集體智慧編程中文版---第二章

中英文術語對照表

英文 中文
clustering j聚類
computationally intensive 計算量很大的
cross-product 叉乘
dendrogram 樹狀圖
groups 羣組
kernl methods,kernel tricks 核方法,核技法
k-nearest neighbors k-最近鄰
multidimensional scalling 多維縮放
pattern 模式
solution (題)解
collective intelligence 集體智慧
crawl (網頁)檢索
data-intensive 數據量很大的
dot-product 點積
inbound link,incoming link 外部回指鏈接
K-Means K-均值
list comprehension 列表推導式
observation 觀測數據,觀測值
similarity 相似度,相似性
vertical search engine 垂直搜索引擎

前言

本書的目的是要帶領超越以數據庫爲後端的簡單應用系統,並告訴你如何利用自己和他人每天蒐集到的信息來編寫更爲智能的程序。

第二章

  • 蒐集偏好
    我們要做的第一件事情,是尋找一種表達不同人及其偏好的方法。

    在python中,達到這一目的的一種非常簡單的方法是使用一個嵌套的字典。如果你打算運行本節中的示例,請新建一個名爲recommendations.py的文件,並加入如下代碼構造一個數據集:

critics = {'Lisa Rose':{'Lady in the Water':2.5, 'Snake on a Plane':3.5,
            'Just My Luck':3.0, 'Superman Returns':3.5, 'You, Me and Dupree':2.5,
            'The Night Listener':3.0},
            
            'Gene Seymour':{'Lady in the Water':3.0, 'Snake on a Plane':3.5,
            'Just My Luck':1.5, 'Superman Returns':5.0, 'The Night Listener':3.0,
            'You, Me and Dupree':3.5,},
            
            'Michael Phillips':{'Lady in the Water':2.5, 'Snake on a Plane':3.0,
            'Superman Returns':3.5, 'The Night Listener':4.0},
            
            'Claudia Puig':{'Snake on a Plane':3.5,'Just My Luck':3.0,
            'The Night Listener':4.5, 'Superman Returns':4.0, 
            'You, Me and Dupree':2.5},
            
            'Mick LaSalle':{'Lady in the Water':3.0, 'Snake on a Plane':4.0,
            'Just My Luck':2.0, 'Superman Returns':3.0, 'The Night Listener':3.0,
            'You, Me and Dupree':2.0},
            
            'Jack Matthews':{'Lady in the Water':3.0, 'Snake on a Plane':4.0,
            'The Night Listener':3.0, 'Superman Returns':5.0, 'You, Me and Dupree':3.5},
            
            'Toby':{'Snake on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0}}

  • 尋找相似用戶
    蒐集完人們的偏好數據之後,我們須要有一種方法確定人們在品味方面的相似程度,爲此,我們可以將每個人與所有其他人進行對比,並計算他們的相似度評價值。有若干種方法可以達到此目的,本節中我們將介紹兩套計算相似評價值的體系:歐幾里得距離和皮爾遜相似度。

歐幾里得距離

它以經過人們一致評價的物品爲座標軸,然後將參與評價的人繪製到圖上,並考查他們彼此間的距離遠近。

改圖顯示了處於“偏好空間”中人們的分佈狀況,Toby在Snakes軸線和Dupree軸線上所標示的數值分別是4.5和1.0.兩人在“偏好空間”中的距離越近,他們的興趣偏好就越相似。因爲這張圖是二維的,所以在容易時間內你只能看到兩項評分,但是這一規則對於更多數據的評分項而言也是同樣適用的。

計算出距離值,偏好越相似的人,其距離就越短。不過,我們還需要一個函數,來對偏好越相近的情況給出越大的值。爲此,我們可以將函數值加1(這樣就可以避免遇到被零整除的錯誤了),並取其倒數:

>>> 1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2)))
0.4721359549995794

這一新的函數總是返回介於0到1之間的值,返回1則表示兩人具有一樣的偏好。我們經前述知識結合起來,就可以構造出用來計算相似度的函數了。將下列代碼加入recommendations.py:

from math import sqrt
 
# 返回一個有關 person1 和 person2 的基於距離的相似度評價
def sim_distance(prefs, person1, person2):
    #得到 shared_items 的列表
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:    
            si[item] = 1
 
    # 如果沒有共同之處,返回 0
    if len(si) == 0:return 0
 
    # 計算所有差值的平方和
    sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2)
                        for item in prefs[person1] if item in prefs[person2]])
    
    return 1/(1 + sqrt(sum_of_squares))

皮爾遜相似度
該相關係數是判斷兩組數據與某一直線擬合程度的一種度量。對應的公式比歐幾里得距離評價的計算公式要複雜,但是它的數據不是很規範的時候(比如,影評者對影片的評價總是相對於平均水平偏離很大時),會傾向於給出更好的結果。

爲了形象地展示這一方法,我們可以在圖上標示出兩位評論者的評分情況,如下圖所示。Mick LaSalle爲《Superman》評了3分,而Gene Seymour則評了5分,所以該影片被定位在圖中的(3.5)處。

在這裏插入圖片描述
在採用皮爾遜方法進行評價時,我們可以從圖上發現一個值得注意的地方,那就是它修正了“誇大分值(grade inflation)”的情況。在這張圖中,雖然Jack Matthews總是傾向於給出比Lisa Rose更高的分支,但最終的直線仍然是擬合的,這是因爲他們兩者有着相對近似的偏好。如果某人總是傾向於給出比另一個人更高的分值,而兩者的分值之差又始終保持一致,則他們依然可能會存在很好的相關性。此前提到過的歐幾里得距離評價方法,會因爲一個人的評價始終比另一個的更爲“嚴格”(從而導致評價始終相對偏低),而得出兩者不想死的結論,即使他們的品味很相似也是如此。而這一行爲是否就是我們想要的結果,則卻決於具體的應用場景。

皮爾遜相似度評價算法首先會找出兩位評論者都曾評價過的物品,然後計算兩者的評分總和與平方和,並求得評分的乘積之和。

# 返回 p1 和 p2 的皮爾遜相關係數
def sim_pearson(prefs, p1, p2):
    # 得到雙方都曾評價過的物品列表
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:si[item] = 1
 
    # 得到 si 列表元素的個數
    n = len(si)
 
    # 如果兩者沒有共同之處,返回 1
    if n == 0:return 1
 
    # 對所有偏好求和
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
 
    # 求平方和
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
 
    # 求乘積和
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
 
    # 計算皮爾遜評價值
    num = pSum - (sum1 * sum2 /n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:return 0
 
    r = num/den
 
    return r

該函數將返回一個介於-1與1之間的數值。值爲1則表明兩個人對每一樣物品具有着完全一致的評價。

import recommendation
print(recommendation.sim_pearson(recommendation.critics,'Lisa Rose','Gene Seymour') )

爲評論者打分
我們已經有了對兩個人進行比較的函數,下面我們可以根據指定人員對每個人進行打分,並找出最接近的匹配結果。

# 從反映偏好的字典中返回最爲匹配者
# 返回結果的個數和相似度函數均爲可選參數
def topMatches(prefs, person, n = 5, similarity = sim_pearson):
    scores = [(similarity(prefs, person, other), other)for other in prefs if other != person]
    # 參見列表推導式
    # 對列表排序,評價值最高的排在前面
    scores.sort() # 排序
    scores.reverse() # 反向列表中元素
    return scores[0:n]

調用該方法並傳入自己的姓名,將得到一個有關影評者及其相似度評價的列表:

>>> import recommendation
>>> recommendation.topMatches(recommendation.critics,'Toby',n=3)
[(0.9912407071619299, 'Lisa Rose'), (0.9244734516419049, 'Mick LaSalle'), (0.8934051474415647, 'Claudia Puig')]

根據返回的結果我們瞭解到,應當閱讀Lisa Rose所撰寫的評論,因爲她的品味與我們的很相似。

推薦物品

找到一位趣味相投的影評者並閱讀他所撰寫的評論固然不錯,但現在我們真正想要的不是這些,而是一份影片的推薦。

爲了解決冷啓動、奇異值等問題,我們需要通過一個經過加權的評價值來爲影片打分,評論者的評分結果因此而形成了先後的排名。爲此,我們需要取得所有其他評論者的評價結果,藉此得到相似度後,再乘以他們爲每部影片所給的評價值。

以S.x打頭的列給出了乘以評價值之後的相似度。如此一來,相比於與我們不想近的人,那些與我們相近的人將會對整體評價值擁有更多的貢獻。總計一行給出了所有加權評價值的總和。

一部受更多人評論的影片會對結果產生更大的影響。爲了修正這一問題,我們需要除以表中名爲Sim.Sum的那一行,它代表了所有對這部電影有過評論的評論者的相似度之和。

# 利用所有他人評價值的加權平均,爲某人提供建議
def getRecommendations(prefs,person,similarity = sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        # 不要和自己作比較
        if other ==person: continue
        sim = similarity(prefs, person, other)
        #忽略評價值爲零或小於零的情況
        if sim <= 0: continue
        for item in prefs[other]:
            #只對自己還沒看過的影片評價
            if item not in prefs[person] or prefs[person][item] == 0:
                # 相似度 * 評價值
                totals.setdefault(item, 0) 
                # setdefault() 函數: 如果鍵不已經存在於字典中,將會添加鍵並將值設爲默認值。
                # dict.setdefault(key, default=None)
                # key -- 查找的鍵值。
                # default -- 鍵不存在時,設置的默認鍵值。
                totals[item] += prefs[other][item] * sim
                # 相似度之和
                simSums.setdefault(item, 0)
                simSums[item] += sim
 
    # 建立一個歸一化的列表
    rankings = [(total / simSums[item], item) for item, total in totals.items()] # 列表推導式
    # 返回經過排序的列表
    rankings.sort()
    rankings.reverse()
    return rankings

上述代碼循環遍歷所有位於字典prefs中的其他人。針對每一次循環,它會計算由person參數所指定的人員與這些人的相似度。然後它會循環遍歷所有打過分的項。

用每一項的評價值乘以相似度,並將所得乘以積累加起來。最後,我們將每個總計值除以相似度之和,藉此對評價值進行歸一化處理,然後返回一個經過排序的結果。

>>> import recommendation
>>> recommendation.getRecommendations(recommendation.critics,'Toby')
[(3.3477895267131017, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.530980703765565, 'Just My Luck')]

現在,我們已經建立起了一個完整的推薦系統,它適用於任何類型的商品或網絡鏈接。我們所要做的全部事情就是:建立一個涉及人員、物品和評價值字典,然後就可以藉此來爲任何人提供建議了。

匹配商品

只需將之前人對物品的評分矩陣轉換爲物品對人的評分矩陣

#這個函數就是將字典裏面的人員和物品對調
def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            #將物品和人員對調
            result[item][person] = prefs[person][item]
    return result

現在,調用之前用過的topMatches函數,可以得到某個影片相近的影片

>>> import recommendation
>>> movies=recommendation.transformPrefs(recommendation.critics)
>>> recommendation.topMatches(movies,'Snake on a Plane')
[(0.7637626158259785, 'Lady in the Water'), (0.11180339887498941, 'Superman Returns'), (-0.3333333333333333, 'Just My Luck'), (-0.5663521139548527, 'The Night Listener'), (-0.6454972243679047, 'You, Me and Dupree')]

在本例中,實際存在着一些相關評論值爲負的情況,這表明存在不喜歡。

上面我們示範了爲某部影片提供相關的推薦,不僅如此,我們設置還可以爲影片推薦評論者。例如,也許我們正在考慮邀請誰和自己一起參加某部影片的首映式。

>>> recommendation.getRecommendations(movies,'Just My Luck')
[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]
>>>

構建一個基於del.icio.us的鏈接推薦系統

由於這部分的api在國內不能使用,‪pydelicious模塊也不能導入。也沒有看到有很好的解決方案,所以只能跳過。

pydelicious.py

"""Library to access del.icio.us data via Python.

:examples:

  Using the API class directly:

  >>> a = pydelicious.apiNew('user', 'passwd')
  >>> # or:
  >>> a = DeliciousAPI('user', 'passwd')
  >>> a.tags_get() # Same as:
  >>> a.request('tags/get', )

  Or by calling the 'convenience' methods on the module.

  - def add(user, passwd, url, description, tags = "", extended = "", dt = "", replace="no"):
  - def get(user, passwd, tag="", dt="",  count = 0):
  - def get_all(user, passwd, tag = ""):
  - def delete(user, passwd, url):
  - def rename_tag(user, passwd, oldtag, newtag):
  - def get_tags(user, passwd):

  >>> a = apiNew(user, passwd)
  >>> a.posts_add(url="http://my.com/", desciption="my.com", extended="the url is my.moc", tags="my com")
  True
  >>> len(a.posts_all())
  1
  >>> get_all(user, passwd)
  1

  This are short functions for getrss calls.

  >>> rss_

def get_userposts(user):
def get_tagposts(tag):
def get_urlposts(url):
def get_popular(tag = ""):

  >>> json_posts()
  >>> json_tags()
  >>> json_network()
  >>> json_fans()

:License: pydelicious is released under the BSD license. See 'license.txt'
 for more informations.

:berend:
 - Rewriting comments to english. More documentation, examples.
 - Added JSON-like return values for XML data (del.icio.us also serves some JSON...)
 - better error/exception classes and handling, work in progress.
 - Encoding seems to be working (using UTF-8 here).

:@todo:
 - Source code SHOULD BE ASCII!
 - More tests.
 - Parse datetimes in XML.
 - Salvage and test RSS functionality?
 - Setup not used, Still works? Should setup.py be tested?
 - API functions need required argument checks.

 * lizense einbinden und auch via setup.py verteilen
 * readme auch schreiben und via setup.py verteilen
 * auch auf anderen systemen testen (linux -> uni)
 * automatisch releases bauen lassen, richtig benennen und in das
   richtige verzeichnis verschieben.
 * was k[o]nnen die anderen librarys denn noch so? (ruby, java, perl, etc)
 * was wollen die, die es benutzen?
 * wof[u]r k[o]nnte ich es benutzen?
 * entschlacken?

:done:
 * Refactored the API class, much cleaner now and functions dlcs_api_request, dlcs_parse_xml are available for who wants them.
 * stimmt das so? muss eher noch t[a]g str2utf8 konvertieren
   >>> pydelicious.getrss(tag="t[a]g")
   url: http://del.icio.us/rss/tag/t[a]g
 * requester muss eine sekunde warten
 * __init__.py gibt die funktionen weiter
 * html parser funktioniert noch nicht, gar nicht
 * alte funktionen fehlen, get_posts_by_url, etc.
 * post funktion erstellen, die auch die fehlenden attribs addiert.
 * die api muss ich noch weiter machen
 * requester muss die 503er abfangen
 * rss parser muss auf viele m[o]glichkeiten angepasst werden
"""
import sys
import os
import time
import datetime
import md5, httplib
import urllib, urllib2, time
from StringIO import StringIO

try:
    from elementtree.ElementTree import parse as parse_xml
except ImportError:
    from  xml.etree.ElementTree import parse as parse_xml

import feedparser


### Static config

__version__ = '0.5.0'
__author__ = 'Frank Timmermann <regenkind_at_gmx_dot_de>' # GP: does not respond to emails
__contributors__ = [
    'Greg Pinero',
    'Berend van Berkum <[email protected]>']
__url__ = 'http://code.google.com/p/pydelicious/'
__author_email__ = ""
# Old URL: 'http://deliciouspython.python-hosting.com/'

__description__ = '''pydelicious.py allows you to access the web service of del.icio.us via it's API through python.'''
__long_description__ = '''the goal is to design an easy to use and fully functional python interface to del.icio.us. '''

DLCS_OK_MESSAGES = ('done', 'ok') # Known text values of positive del.icio.us <result> answers
DLCS_WAIT_TIME = 4
DLCS_REQUEST_TIMEOUT = 444 # Seconds before socket triggers timeout
#DLCS_API_REALM = 'del.icio.us API'
DLCS_API_HOST = 'https://api.del.icio.us'
DLCS_API_PATH = 'v1'
DLCS_API = "%s/%s" % (DLCS_API_HOST, DLCS_API_PATH)
DLCS_RSS = 'http://del.icio.us/rss/'

ISO_8601_DATETIME = '%Y-%m-%dT%H:%M:%SZ'

USER_AGENT = 'pydelicious.py/%s %s' % (__version__, __url__)

DEBUG = 0
if 'DLCS_DEBUG' in os.environ:
    DEBUG = int(os.environ['DLCS_DEBUG'])


# Taken from FeedParser.py
# timeoutsocket allows feedparser to time out rather than hang forever on ultra-slow servers.
# Python 2.3 now has this functionality available in the standard socket library, so under
# 2.3 you don't need to install anything.  But you probably should anyway, because the socket
# module is buggy and timeoutsocket is better.
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(DLCS_REQUEST_TIMEOUT)
except ImportError:
    import socket
    if hasattr(socket, 'setdefaulttimeout'): socket.setdefaulttimeout(DLCS_REQUEST_TIMEOUT)
if DEBUG: print >>sys.stderr, "Set socket timeout to %s seconds" % DLCS_REQUEST_TIMEOUT


### Utility classes

class _Waiter:
    """Waiter makes sure a certain amount of time passes between
    successive calls of `Waiter()`.

    Some attributes:
    :last: time of last call
    :wait: the minimum time needed between calls
    :waited: the number of calls throttled

    pydelicious.Waiter is an instance created when the module is loaded.
    """
    def __init__(self, wait):
        self.wait = wait
        self.waited = 0
        self.lastcall = 0;

    def __call__(self):
        tt = time.time()

        timeago = tt - self.lastcall

        if self.lastcall and DEBUG>2:
            print >>sys.stderr, "Lastcall: %s seconds ago." % lastcall

        if timeago <= self.wait:
            if DEBUG>0: print >>sys.stderr, "Waiting %s seconds." % self.wait
            time.sleep(self.wait)
            self.waited += 1
            self.lastcall = tt + self.wait
        else:
            self.lastcall = tt

Waiter = _Waiter(DLCS_WAIT_TIME)

class PyDeliciousException(Exception):
    '''Std. pydelicious error'''
    pass

class DeliciousError(Exception):
	"""Raised when the server responds with a negative answer"""


class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
    '''@xxx:bvb: Where is this used? should it be registered somewhere with urllib2?

    Handles HTTP Error, currently only 503.
    '''
    def http_error_503(self, req, fp, code, msg, headers):
        raise urllib2.HTTPError(req, code, throttled_message, headers, fp)


class post(dict):
    """Post object, contains href, description, hash, dt, tags,
    extended, user, count(, shared).

    @xxx:bvb: Is this needed? Right now this is superfluous,
    """
    def __init__(self, href = "", description = "", hash = "", time = "", tag = "", extended = "", user = "", count = "",
                 tags = "", url = "", dt = ""): # tags or tag?
        self["href"] = href
        if url != "": self["href"] = url
        self["description"] = description
        self["hash"] = hash
        self["dt"] = dt
        if time != "": self["dt"] = time
        self["tags"] = tags
        if tag != "":  self["tags"] = tag     # tag or tags? # !! tags
        self["extended"] = extended
        self["user"] = user
        self["count"] = count

    def __getattr__(self, name):
        try: return self[name]
        except: object.__getattribute__(self, name)


class posts(list):
    """@xxx:bvb: idem as class post, python structures (dict/list) might
    suffice or a more generic solution is needed.
    """
    def __init__(self, *args):
        for i in args: self.append(i)

    def __getattr__(self, attr):
        try: return [p[attr] for p in self]
        except: object.__getattribute__(self, attr)

### Utility functions

def str2uni(s):
    # type(in) str or unicode
    # type(out) unicode
    return ("".join([unichr(ord(i)) for i in s]))

def str2utf8(s):
    # type(in) str or unicode
    # type(out) str
    return ("".join([unichr(ord(i)).encode("utf-8") for i in s]))

def str2quote(s):
    return urllib.quote_plus("".join([unichr(ord(i)).encode("utf-8") for i in s]))

def dict0(d):
    # Trims empty dict entries
    # {'a':'a', 'b':'', 'c': 'c'} => {'a': 'a', 'c': 'c'}
    dd = dict()
    for i in d:
            if d[i] != "": dd[i] = d[i]
    return dd

def delicious_datetime(str):
    """Parse a ISO 8601 formatted string to a Python datetime ...
    """
    return datetime.datetime(*time.strptime(str, ISO_8601_DATETIME)[0:6])

def http_request(url, user_agent=USER_AGENT, retry=4):
    """Retrieve the contents referenced by the URL using urllib2.

    Retries up to four times (default) on exceptions.
    """
    request = urllib2.Request(url, headers={'User-Agent':user_agent})

    # Remember last error
    e = None

    # Repeat request on time-out errors
    tries = retry;
    while tries:
        try:
            return urllib2.urlopen(request)

        except urllib2.HTTPError, e: # protocol errors,
            raise PyDeliciousException, "%s" % e

        except urllib2.URLError, e:
            # @xxx: Ugly check for time-out errors
			#if len(e)>0 and 'timed out' in arg[0]:
			print >> sys.stderr, "%s, %s tries left." % (e, tries)
			Waiter()
			tries = tries - 1
			#else:
			#	tries = None

    # Give up
    raise PyDeliciousException, \
            "Unable to retrieve data at '%s', %s" % (url, e)

def http_auth_request(url, host, user, passwd, user_agent=USER_AGENT):
    """Call an HTTP server with authorization credentials using urllib2.
    """
    if DEBUG: httplib.HTTPConnection.debuglevel = 1

    # Hook up handler/opener to urllib2
    password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
    password_manager.add_password(None, host, user, passwd)
    auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
    opener = urllib2.build_opener(auth_handler)
    urllib2.install_opener(opener)

    return http_request(url, user_agent)

def dlcs_api_request(path, params='', user='', passwd='', throttle=True):
    """Retrieve/query a path within the del.icio.us API.

    This implements a minimum interval between calls to avoid
    throttling. [#]_ Use param 'throttle' to turn this behaviour off.

    @todo: back off on 503's (HTTPError, URLError? @todo: testing).

    Returned XML does not always correspond with given del.icio.us examples
    @todo: (cf. help/api/... and post's attributes)

    .. [#] http://del.icio.us/help/api/
    """
    if throttle:
        Waiter()

    if params:
        # params come as a dict, strip empty entries and urlencode
        url = "%s/%s?%s" % (DLCS_API, path, urllib.urlencode(dict0(params)))
    else:
        url = "%s/%s" % (DLCS_API, path)

    if DEBUG: print >>sys.stderr, "dlcs_api_request: %s" % url

    try:
        return http_auth_request(url, DLCS_API_HOST, user, passwd, USER_AGENT)

    # @bvb: Is this ever raised? When?
    except DefaultErrorHandler, e:
        print >>sys.stderr, "%s" % e

def dlcs_parse_xml(data, split_tags=False):
    """Parse any del.icio.us XML document and return Python data structure.

    Recognizes all XML document formats as returned by the version 1 API and
    translates to a JSON-like data structure (dicts 'n lists).

    Returned instance is always a dictionary. Examples::

     {'posts': [{'url':'...','hash':'...',},],}
     {'tags':['tag1', 'tag2',]}
     {'dates': [{'count':'...','date':'...'},], 'tag':'', 'user':'...'}
	 {'result':(True, "done")}
     # etcetera.
    """

    if DEBUG>3: print >>sys.stderr, "dlcs_parse_xml: parsing from ", data

    if not hasattr(data, 'read'):
        data = StringIO(data)

    doc = parse_xml(data)
    root = doc.getroot()
    fmt = root.tag

	# Split up into three cases: Data, Result or Update
    if fmt in ('tags', 'posts', 'dates', 'bundles'):

        # Data: expect a list of data elements, 'resources'.
        # Use `fmt` (without last 's') to find data elements, elements
        # don't have contents, attributes contain all the data we need:
        # append to list
        elist = [el.attrib for el in doc.findall(fmt[:-1])]

        # Return list in dict, use tagname of rootnode as keyname.
        data = {fmt: elist}

        # Root element might have attributes too, append dict.
        data.update(root.attrib)

        return data

    elif fmt == 'result':

        # Result: answer to operations
        if root.attrib.has_key('code'):
            msg = root.attrib['code']
        else:
            msg = root.text

		# Return {'result':(True, msg)} for /known/ O.K. messages,
        # use (False, msg) otherwise
        v = msg in DLCS_OK_MESSAGES
        return {fmt: (v, msg)}

    elif fmt == 'update':

        # Update: "time"
        #return {fmt: root.attrib}
		return {fmt: {'time':time.strptime(root.attrib['time'], ISO_8601_DATETIME)}}

    else:
        raise PyDeliciousException, "Unknown XML document format '%s'" % fmt

def dlcs_rss_request(tag = "", popular = 0, user = "", url = ''):
    """Handle a request for RSS

    @todo: translate from German

    rss sollte nun wieder funktionieren, aber diese try, except scheisse ist so nicht schoen

    rss wird unterschiedlich zusammengesetzt. ich kann noch keinen einheitlichen zusammenhang
    zwischen daten (url, desc, ext, usw) und dem feed erkennen. warum k[o]nnen die das nicht einheitlich machen?
    """
    tag = str2quote(tag)
    user = str2quote(user)
    if url != '':
        # http://del.icio.us/rss/url/efbfb246d886393d48065551434dab54
        url = DLCS_RSS + '''url/%s'''%md5.new(url).hexdigest()
    elif user != '' and tag != '':
        url = DLCS_RSS + '''%(user)s/%(tag)s'''%dict(user=user, tag=tag)
    elif user != '' and tag == '':
        # http://del.icio.us/rss/delpy
        url = DLCS_RSS + '''%s'''%user
    elif popular == 0 and tag == '':
        url = DLCS_RSS
    elif popular == 0 and tag != '':
        # http://del.icio.us/rss/tag/apple
        # http://del.icio.us/rss/tag/web2.0
        url = DLCS_RSS + "tag/%s"%tag
    elif popular == 1 and tag == '':
        url = DLCS_RSS + '''popular/'''
    elif popular == 1 and tag != '':
        url = DLCS_RSS + '''popular/%s'''%tag
    rss = http_request(url).read()
    rss = feedparser.parse(rss)
    # print rss
#     for e in rss.entries: print e;print
    l = posts()
    for e in rss.entries:
        if e.has_key("links") and e["links"]!=[] and e["links"][0].has_key("href"):
            url = e["links"][0]["href"]
        elif e.has_key("link"):
            url = e["link"]
        elif e.has_key("id"):
            url = e["id"]
        else:
            url = ""
        if e.has_key("title"):
            description = e['title']
        elif e.has_key("title_detail") and e["title_detail"].has_key("title"):
            description = e["title_detail"]['value']
        else:
            description = ''
        try: tags = e['categories'][0][1]
        except:
            try: tags = e["category"]
            except: tags = ""
        if e.has_key("modified"):
            dt = e['modified']
        else:
            dt = ""
        if e.has_key("summary"):
            extended = e['summary']
        elif e.has_key("summary_detail"):
            e['summary_detail']["value"]
        else:
            extended = ""
        if e.has_key("author"):
            user = e['author']
        else:
            user = ""
#  time = dt ist weist auf ein problem hin
# die benennung der variablen ist nicht einheitlich
#  api senden und
#  xml bekommen sind zwei verschiedene schuhe :(
        l.append(post(url = url, description = description, tags = tags, dt = dt, extended = extended, user = user))
    return l


### Main module class

class DeliciousAPI:
    """Class providing main interace to del.icio.us API.

    Methods ``request`` and ``request_raw`` represent the core. For all API
    paths there are furthermore methods (e.g. posts_add for 'posts/all') with
    an explicit declaration of the parameters and documentation. These all call
    ``request`` and pass on extra keywords like ``_raw``.
    """

    def __init__(self, user, passwd, codec='iso-8859-1', api_request=dlcs_api_request, xml_parser=dlcs_parse_xml):
        """Initialize access to the API with ``user`` and ``passwd``.

        ``codec`` sets the encoding of the arguments.

        The ``api_request`` and ``xml_parser`` parameters by default point to
        functions within this package with standard implementations to
        request and parse a resource. See ``dlcs_api_request()`` and
        ``dlcs_parse_xml()``. Note that ``api_request`` should return a
        file-like instance with an HTTPMessage instance under ``info()``,
        see ``urllib2.openurl`` for more info.
        """
        assert user != ""
        self.user = user
        self.passwd = passwd
        self.codec = codec

        # Implement communication to server and parsing of respons messages:
        assert callable(api_request)
        self._api_request = api_request
        assert callable(xml_parser)
        self._parse_response = xml_parser

    def _call_server(self, path, **params):
        params = dict0(params)
        for key in params:
            params[key] = params[key].encode(self.codec)

        # see __init__ for _api_request()
        return self._api_request(path, params, self.user, self.passwd)


    ### Core functionality

    def request(self, path, _raw=False, **params):
        """Calls a path in the API, parses the answer to a JSON-like structure by
        default. Use with ``_raw=True`` or ``call request_raw()`` directly to
        get the filehandler and process the response message manually.

        Calls to some paths will return a `result` message, i.e.::

            <result code="..." />

        or::

            <result>...</result>

        These are all parsed to ``{'result':(Boolean, MessageString)}`` and this
        method will raise ``DeliciousError`` on negative `result` answers. Using
        ``_raw=True`` bypasses all parsing and will never raise ``DeliciousError``.

        See ``dlcs_parse_xml()`` and ``self.request_raw()``."""

        # method _parse_response is bound in `__init__()`, `_call_server`
        # uses `_api_request` also set in `__init__()`
        if _raw:
            # return answer
            return self.request_raw(path, **params)

        else:
            # get answer and parse
            fl = self._call_server(path, **params)
            rs = self._parse_response(fl)

			# Raise an error for negative 'result' answers
            if type(rs) == dict and rs == 'result' and not rs['result'][0]:
                errmsg = ""
                if len(rs['result'])>0:
                    errmsg = rs['result'][1:]
                raise DeliciousError, errmsg

            return rs

    def request_raw(self, path, **params):
        """Calls the path in the API, returns the filehandle. Returned
        file-like instances have an ``HTTPMessage`` instance with HTTP header
        information available. Use ``filehandle.info()`` or refer to the
        ``urllib2.openurl`` documentation.
        """
        # see `request()` on how the response can be handled
        return self._call_server(path, **params)

    ### Explicit declarations of API paths, their parameters and docs

    # Tags
    def tags_get(self, **kwds):
        """Returns a list of tags and the number of times it is used by the user.
        ::

            <tags>
                <tag tag="TagName" count="888">
        """
        return self.request("tags/get", **kwds)

    def tags_rename(self, old, new, **kwds):
        """Rename an existing tag with a new tag name. Returns a `result`
        message or raises an ``DeliciousError``. See ``self.request()``.

        &old (required)
            Tag to rename.
        &new (required)
            New name.
        """
        return self.request("tags/rename", old=old, new=new, **kwds)

    # Posts
    def posts_update(self, **kwds):
        """Returns the last update time for the user. Use this before calling
        `posts_all` to see if the data has changed since the last fetch.
        ::

            <update time="CCYY-MM-DDThh:mm:ssZ">
		"""
        return self.request("posts/update", **kwds)

    def posts_dates(self, tag="", **kwds):
        """Returns a list of dates with the number of posts at each date.
        ::

            <dates>
                <date date="CCYY-MM-DD" count="888">

        &tag (optional).
            Filter by this tag.
        """
        return self.request("posts/dates", tag=tag, **kwds)

    def posts_get(self, tag="", dt="", url="", **kwds):
        """Returns posts matching the arguments. If no date or url is given,
        most recent date will be used.
        ::

            <posts dt="CCYY-MM-DD" tag="..." user="...">
                <post ...>

        &tag (optional).
            Filter by this tag.
        &dt (optional).
            Filter by this date (CCYY-MM-DDThh:mm:ssZ).
        &url (optional).
            Filter by this url.
        """
        return self.request("posts/get", tag=tag, dt=dt, url=url, **kwds)

    def posts_recent(self, tag="", count="", **kwds):
        """Returns a list of the most recent posts, filtered by argument.
        ::

            <posts tag="..." user="...">
                <post ...>

        &tag (optional).
            Filter by this tag.
        &count (optional).
            Number of items to retrieve (Default:15, Maximum:100).
        """
        return self.request("posts/recent", tag=tag, count=count, **kwds)

    def posts_all(self, tag="", **kwds):
        """Returns all posts. Please use sparingly. Call the `posts_update`
        method to see if you need to fetch this at all.
        ::

            <posts tag="..." user="..." update="CCYY-MM-DDThh:mm:ssZ">
                <post ...>

        &tag (optional).
            Filter by this tag.
        """
        return self.request("posts/all", tag=tag, **kwds)

    def posts_add(self, url, description, extended="", tags="", dt="",
            replace="no", shared="yes", **kwds):
        """Add a post to del.icio.us. Returns a `result` message or raises an
        ``DeliciousError``. See ``self.request()``.

        &url (required)
            the url of the item.
        &description (required)
            the description of the item.
        &extended (optional)
            notes for the item.
        &tags (optional)
            tags for the item (space delimited).
        &dt (optional)
            datestamp of the item (format "CCYY-MM-DDThh:mm:ssZ").

        Requires a LITERAL "T" and "Z" like in ISO8601 at http://www.cl.cam.ac.uk/~mgk25/iso-time.html for example: "1984-09-01T14:21:31Z"
        &replace=no (optional) - don't replace post if given url has already been posted.
        &shared=no (optional) - make the item private
        """
        return self.request("posts/add", url=url, description=description,
                extended=extended, tags=tags, dt=dt,
                replace=replace, shared=shared, **kwds)

    def posts_delete(self, url, **kwds):
        """Delete a post from del.icio.us. Returns a `result` message or
        raises an ``DeliciousError``. See ``self.request()``.

        &url (required)
            the url of the item.
        """
        return self.request("posts/delete", url=url, **kwds)

    # Bundles
    def bundles_all(self, **kwds):
        """Retrieve user bundles from del.icio.us.
        ::

            <bundles>
                <bundel name="..." tags=...">
        """
        return self.request("tags/bundles/all", **kwds)

    def bundles_set(self, bundle, tags, **kwds):
        """Assign a set of tags to a single bundle, wipes away previous
        settings for bundle. Returns a `result` messages or raises an
        ``DeliciousError``. See ``self.request()``.

        &bundle (required)
            the bundle name.
        &tags (required)
            list of tags (space seperated).
        """
        if type(tags)==list:
            tags = " ".join(tags)
        return self.request("tags/bundles/set", bundle=bundle, tags=tags,
                **kwds)

    def bundles_delete(self, bundle, **kwds):
        """Delete a bundle from del.icio.us. Returns a `result` message or
        raises an ``DeliciousError``. See ``self.request()``.

        &bundle (required)
            the bundle name.
        """
        return self.request("tags/bundles/delete", bundle=bundle, **kwds)

    ### Utils

    # Lookup table for del.icio.us url-path to DeliciousAPI method.
    paths = {
        'tags/get': tags_get,
        'tags/rename': tags_rename,
        'posts/update': posts_update,
        'posts/dates': posts_dates,
        'posts/get': posts_get,
        'posts/recent': posts_recent,
        'posts/all': posts_all,
        'posts/add': posts_add,
        'posts/delete': posts_delete,
        'tags/bundles/all': bundles_all,
        'tags/bundles/set': bundles_set,
        'tags/bundles/delete': bundles_delete,
    }

    def get_url(self, url):
        """Return the del.icio.us url at which the HTML page with posts for
        ``url`` can be found.
        """
        return "http://del.icio.us/url/?url=%s" % (url,)


### Convenience functions on this package

def apiNew(user, passwd):
    """creates a new DeliciousAPI object.
    requires user(name) and passwd
	"""
    return DeliciousAPI(user=user, passwd=passwd)

def add(user, passwd, url, description, tags="", extended="", dt="", replace="no"):
    return apiNew(user, passwd).posts_add(url=url, description=description, extended=extended, tags=tags, dt=dt, replace=replace)

def get(user, passwd, tag="", dt="",  count = 0):
    posts = apiNew(user, passwd).posts_get(tag=tag,dt=dt)
    if count != 0: posts = posts[0:count]
    return posts

def get_all(user, passwd, tag=""):
    return apiNew(user, passwd).posts_all(tag=tag)

def delete(user, passwd, url):
    return apiNew(user, passwd).posts_delete(url=url)

def rename_tag(user, passwd, oldtag, newtag):
    return apiNew(user=user, passwd=passwd).tags_rename(old=oldtag, new=newtag)

def get_tags(user, passwd):
    return apiNew(user=user, passwd=passwd).tags_get()


### RSS functions @bvb: still working...?
def getrss(tag="", popular=0, url='', user=""):
    """get posts from del.icio.us via parsing RSS @bvb[or HTML]

	@bvb[not tested]

    tag (opt) sort by tag
    popular (opt) look for the popular stuff
    user (opt) get the posts by a user, this striks popular
    url (opt) get the posts by url
	"""
    return dlcs_rss_request(tag=tag, popular=popular, user=user, url=url)

def get_userposts(user):
    return getrss(user = user)

def get_tagposts(tag):
    return getrss(tag = tag)

def get_urlposts(url):
    return getrss(url = url)

def get_popular(tag = ""):
    return getrss(tag = tag, popular = 1)


### @TODO: implement JSON fetching
def json_posts(user, count=15):
    """http://del.icio.us/feeds/json/mpe
    http://del.icio.us/feeds/json/mpe/art+history
    count=###   the number of posts you want to get (default is 15, maximum is 100)
    raw         a raw JSON object is returned, instead of an object named Delicious.posts
    """

def json_tags(user, atleast, count, sort='alpha'):
    """http://del.icio.us/feeds/json/tags/mpe
    atleast=###         include only tags for which there are at least ### number of posts
    count=###           include ### tags, counting down from the top
    sort={alpha|count}  construct the object with tags in alphabetic order (alpha), or by count of posts (count)
    callback=NAME       wrap the object definition in a function call NAME(...), thus invoking that function when the feed is executed
    raw                 a pure JSON object is returned, instead of code that will construct an object named Delicious.tags
    """

def json_network(user):
    """http://del.icio.us/feeds/json/network/mpe
    callback=NAME       wrap the object definition in a function call NAME(...)
    ?raw         a raw JSON object is returned, instead of an object named Delicious.posts
    """

def json_fans(user):
    """http://del.icio.us/feeds/json/fans/mpe
    callback=NAME       wrap the object definition in a function call NAME(...)
    ?raw         a pure JSON object is returned, instead of an object named Delicious.
    """


deliciousrec.py

from pydelicious import get_popular,get_userposts,get_urlposts
import time

def initializeUserDict(tag,count=5):
  user_dict={}
  # get the top count' popular posts
  for p1 in get_popular(tag=tag)[0:count]:
    # find all users who posted this
    for p2 in get_urlposts(p1['href']):
      user=p2['user']
      user_dict[user]={}
  return user_dict

def fillItems(user_dict):
  all_items={}
  # Find links posted by all users
  for user in user_dict:
    for i in range(3):
      try:
        posts=get_userposts(user)
        break
      except:
        print "Failed user "+user+", retrying"
        time.sleep(4)
    for post in posts:
      url=post['href']
      user_dict[user][url]=1.0
      all_items[url]=1
  
  # Fill in missing items with 0
  for ratings in user_dict.values():
    for item in all_items:
      if item not in ratings:
        ratings[item]=0.0

推薦近鄰與鏈接
爲了隨機選擇一位用戶,並找到與其品味相近的其他用戶

我們也可以通過調用getRecommendations函數爲該用戶獲取推薦鏈接。因爲方法調用將會依序返回全部物品,所以最好將其限制在前10條:

偏好列表中的各項是可以被調換的,這樣我們就可以依據鏈接而非人員來進行搜索了。

基於物品的過濾

其總體思路就是爲沒見物品預先計算好最爲相近的其他物品。然後,當我們想爲某位用戶提供推薦時,就可以查看他曾經評過分的物品,並從中選出排位靠前者,再構造出一個加權列表。物品間的比較不會項用戶間的比較那麼頻繁變化。所以,無須不停地計算與每樣物品最爲相近地其他物品,我們可以將這樣地運算任務安排再網絡流量不是很大地時候進行,或者再獨立於主應用之外地另一臺計算機上單獨進行。

構造物品比較數據集

def calculateSimilarItems(prefs, n = 10):
    # 建立字典,以給出與這些物品最爲相近的所有其他物品
    result = {}
    # 以物品爲中心對偏好矩陣實施倒置處理
    itemPrefs = transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        # 針對大數據集更新狀態變量
        c += 1
        if c % 100 == 0:print ("%d / %d") % (c, len(itemPrefs))
        # 尋找最爲相近的物品
        scores = topMatches(itemPrefs, item, n = n, similarity = sim_distance)
        result[item] = scores
    return result # 返回一個包含物品及其最相近物品列表的字典

該函數首先利用了此前定義過地transformPrefs函數,對反映評價值地字典進行倒置處理,從而得到一個有關物品及其用戶評價情況的列表。然後,程序又循環遍歷每項物品,並將轉換了的字典傳入topMatches函數中,求得最爲相近的物品及其相似度評價值。最後,他建立並返回了一個包含物品及其最相近物品列表的字典。

>>> import recommendation
>>> itemsim=recommendation.calculateSimilarItems(recommendation.critics)
>>> itemsim
{'Lady in the Water': [(0.4494897427831781, 'You, Me and Dupree'), (0.38742588672279304, 'The Night Listener'), (0.3483314773547883, 'Snake on a Plane'), (0.3483314773547883, 'Just My Luck'), (0.2402530733520421, 'Superman Returns')], 'Snake on a Plane': [(0.3483314773547883, 'Lady in the Water'), (0.32037724101704074, 'The Night Listener'), (0.3090169943749474, 'Superman Returns'), (0.2553967929896867, 'Just My Luck'), (0.1886378647726465, 'You, Me and Dupree')], 'Just My Luck': [(0.3483314773547883, 'Lady in the Water'), (0.32037724101704074, 'You, Me and Dupree'), (0.2989350844248255, 'The Night Listener'), (0.2553967929896867, 'Snake on a Plane'), (0.20799159651347807, 'Superman Returns')], 'Superman Returns': [(0.3090169943749474, 'Snake on a Plane'), (0.252650308587072, 'The Night Listener'), (0.2402530733520421, 'Lady in the Water'), (0.20799159651347807, 'Just My Luck'), (0.1918253663634734, 'You, Me and Dupree')], 'You, Me and Dupree': [(0.4494897427831781, 'Lady in the Water'), (0.32037724101704074, 'Just My Luck'), (0.29429805508554946, 'The Night Listener'), (0.1918253663634734, 'Superman Returns'), (0.1886378647726465, 'Snake on a Plane')], 'The Night Listener': [(0.38742588672279304, 'Lady in the Water'), (0.32037724101704074, 'Snake on a Plane'), (0.2989350844248255, 'Just My Luck'), (0.29429805508554946, 'You, Me and Dupree'), (0.252650308587072, 'Superman Returns')]}

只有頻繁執行該函數,才能令物品的相似度不至於過期。通常我們需要再用戶基數和評分數量不是很大的時候執行這一函數,但是隨着用戶數量的不斷增長,物品間的相似度評價值通常會變得越來越穩定。

獲得推薦

def getRecommendedItems(prefs, itemMatch, user):#itemMatch 物品相似度矩陣
    userRatings = prefs[user]
    scores = {}
    totalSim = {}
    # 循環遍歷由當前用戶評分的物品
    for (item, rating) in userRatings.items(): # dict.items() 此方法返回元組對的列表。
        # 尋遍遍歷與當前物品相似的物品
        for (similarity, item2) in itemMatch[item]:
            # 如果該用戶已經對當前物品做過評價,則將其忽略
            if item2 in userRatings: continue
            # 評價值與相似度的加權之和
            scores.setdefault(item2, 0) # setdefault 見前面註釋
            scores[item2] += similarity * rating
            # 全部相似度之和
            totalSim.setdefault(item2, 0)
            totalSim[item2] += similarity
        # 將每個合計值除以加權和,求出平均值
    rankings = [(score / totalSim[item], item) for item, score in scores.items()]
        # 按最高值到最低值的順序,返回評分結果
    rankings.sort()
    rankings.reverse()
    return rankings

爲Toby提供一個新的推薦結果

>>> import recommendation
>>> itemsim=recommendation.calculateSimilarItems(recommendation.critics)
>>> recommendation.getRecommendedItems(recommendation.critics,itemsim,'Toby')
[(3.1667425234070894, 'The Night Listener'), (2.9366294028444346, 'Just My Luck'), (2.868767392626467, 'Lady in the Water')]

使用MovieLens數據集

數據集獲取的網站,注意下的數據是這個小的

需要的數據格式爲:

每一行數據都包含了一個用戶ID、影片ID、用戶對該片所給的評分,以及評分的時間。我們可以通過影片的ID獲取到片名,但對於用戶數據而言,由於是匿名的,因此再本節中我們只能對用戶ID進行處理。

def loadMovieLens(path = '/data/ml-latest-small'):
    # 獲取影片標題
    movies = {}
    for line in open(path + '/movies.csv'):
        (movieId, title, genres) = line.split('\t')[0:2] # 這裏文件中第三列是影片類型,略作修改
        movies[id] = title # 把 title 和 id對應
    # 加載數據
    prefs = {}
    for line in open(path + '/ratings.csv'):
        (user, movieid, rating, ts) = line.split('\t') # 分割
        prefs.setdefault(user, {})
        prefs[user][movies[movieid]] = float(rating)
    return prefs

由於從網站上下下來的數據csv是以逗號進行分隔的,但是電影名稱中也有逗號,所以通過excel進行了數據處理,將其轉換爲以製表符爲分隔方式。參考網站1

但是在將rating數據進行格式轉換的時候出現報錯
在這裏插入圖片描述
隨後又再網上看了其他對數據處理的帖子,將代碼改爲了,如下所示參考文章1參考文章2

def loadMovieLens(path='./data/my-small'):
    import csv
    # 獲取影片標題
    movies = {}
    with open(path + '/movies.csv') as movies_file:
        row = csv.reader(movies_file,delimiter='\t')
        next(row)  # 讀取首行
        id = [] #建立一個數組來存電影id
        title =[] #建立一個數組來存電影名稱
        for r in row:
            id.append(r[0])
            title.append(r[1])
            movies[id] = title  # 把 title 和 id對應
    # 加載數據
    prefs = {}
    with open(path + '/ratings.csv') as ratings_file:
        row = csv.reader(ratings_file,delimiter=',')
        next(row)  # 讀取首行
        user = []  # 建立一個數組來存用戶id
        movieid = []  # 建立一個數組來存電影id
        rating = []  # 建立一個數組來存用戶對電影的評價分數
        ts = []  # 建立一個數組來存記錄的時間戳
        # line = line.strip()
        # 讀取除首行之後每一行的的數據,並將其加入到各數組之中
        for r in row:
            user.append(r[0])
            movieid.append(r[1])
            rating.append(float(r[2]))
            ts.append(r[3])
            prefs.setdefault(user, {})
            #prefs[user][movies[movieid]] = float(rating)
            prefs[user][movies[movieid]] = rating
        # print(rating[50])
        #print(type(rating))
    #print('程序執行完成')
    return prefs

但是還是報錯,欺負我不太懂python

以上的報錯,可以參考這篇文章,因爲之前str轉換爲float出錯,然後想想還是隻做將ratings的值賦值爲list類型,最後程序終於通了。下面是讀取電影數據的最終代碼

def loadMovieLens(path='./data/my-small'):
    import csv
    # 獲取影片標題
    movies = {}
    for line in open(path + '/movies.csv'):
        (id, title) = line.split('\t')[0:2]  # 這裏文件中第三列是影片類型,略作修改
        movies[id] = title  # 把 title 和 id對應
    # 加載數據
    prefs = {}
    with open('./data/my-small/ratings.csv') as csv_file:
        row = csv.reader(csv_file,delimiter=',')
        next(row) #讀取首行
        ratings = [] #建立一個數組來存儲評價數據
        #讀取除首行之後每一行的的3列數據,並將其加入到數組ratings之中
        for r in row:
            ratings.append(float(r[2])) #將字符串轉換爲浮點型加入到數組之中
    #print(len(ratings))
    #print(ratings[0])
    #print(ratings[100835])
    i = 0
    for line in open(path + '/ratings.csv'):
        if(i==len(ratings)): break
        (user, movieid, rating, ts) = line.split(',')  # 分割
        prefs.setdefault(user, {})
        prefs[user][movies[movieid]] = ratings[i]
        i+=1
        #print(i)
    return prefs

終於看到與書上一直的輸出了,隨機的查看了一位用戶的評分情況:
在這裏插入圖片描述
終於可以做基於用戶的推薦:

import recommendation
prefs=recommendation.loadMovieLens()
print(prefs['87'])
print(recommendation.getRecommendations(prefs,'87')[0:30])

在這裏插入圖片描述
不太知道爲什麼數據會出現5.00001的情況,可能在做數據轉換的時候,出現了精度的問題?

基於物品的推薦:

itemsim = recommendation.calculateSimilarItems(prefs,50)
print(recommendation.getRecommendedItems(prefs,itemsim,'87')[0:30])

但是又報錯了
在這裏插入圖片描述
出現以上問題是python版本導致的,在python3的版本中變量的輸出語言應該爲:

if c % 100 == 0: print("%d / %d" % (c, len(itemPrefs)))

在這裏插入圖片描述
真的是構造物品的相似度用了很長的時間,但是推薦過程幾乎是在數據構造完畢後瞬間完成的。而且,獲取推薦所花費的時間不會隨着用戶數量的增加而增加。

在針對大數據集生成推薦列表時,基於物品進行過濾的方式明顯要比基於用戶的過濾更快,不過它的確有維護物品相似度表的額外開銷。同時,這種方法根據數據集“稀疏”程度上的不同也存在精準度上的差異。在涉及電影的例子中,由於每個評論者幾乎對每部影片都做過評價,所以數據集是密集的(而非稀疏的)。另一方面,它又不同於查找兩位有相近書籤的用戶—大多數書籤都是爲小衆羣體所收藏的,這就形成了一個稀疏數據集。對於係數數據集,基於物品的過濾方法通常要優於基於用戶的過濾方法,而對於密集數據集而言,兩者的效果幾乎是一樣的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章