Django+python+BeautifulSoup垂直搜索爬蟲

使用python+BeautifulSoup完成爬蟲抓取特定數據的工作，並使用Django搭建一個管理平臺，用來協調抓取工作。

因爲自己很喜歡Django admin後臺，所以這次用這個後臺對抓取到的鏈接進行管理，使我的爬蟲可以應對各種後期的需求。比如分時段抓取，定期的對已經抓取的地址重新抓取。數據庫是用python自帶的sqlite3，所以很方便。

這幾天正好在做一個電影推薦系統，需要些電影數據。本文的例子是對豆瓣電影抓取特定的數據。

第一步：建立Django模型

模仿nutch的爬蟲思路，這裏簡化了。每次抓取任務開始先從數據庫裏找到未保存的(is_save = False)的鏈接，放到抓取鏈表裏。你也可以根據自己的需求去過濾鏈接。

python代碼：

view plain 
class Crawl_URL(models.Model):   
    url = models.URLField('抓取地址',max_length=100, unique=True)   
    weight = models.SmallIntegerField('抓取深度',default = 0)#抓取深度起始1   
    is_save = models.BooleanField('是否已保存',default= False)#   
    date = models.DateTimeField('保存時間',auto_now_add=True,blank=True,null=True)   
    def __unicode__(self):   
        return self.url

然後生成相應的表。

還需要一個admin管理後臺

view plain 
class Crawl_URLAdmin(admin.ModelAdmin):   
    list_display = ('url','weight','is_save','date',)   
    ordering = ('-id',)   
    list_filter = ('is_save','weight','date',)   
    fields = ('url','weight','is_save',)   
admin.site.register(Crawl_URL, Crawl_URLAdmin)

第二步，編寫爬蟲代碼

爬蟲是單線程，並且每次抓取後都有相應的暫定，豆瓣網會禁止一定強度抓取的爬蟲

爬蟲根據深度來控制，每次都是先生成鏈接，然後抓取，並解析出更多的鏈接，最後將抓取過的鏈接is_save=true，並把新鏈接存入數據庫中。每次一個深度抓取完後都需要花比較長的時候把鏈接導入數據庫。因爲需要判斷鏈接是否已存入數據庫。

這個只對滿足正則表達式 http://movie.douban.com/subject/(/d+)/ 的地址進行數據解析。並且直接忽略掉不是電影模塊的鏈接。

第一次抓取需要在後臺加個鏈接，比如http://movie.douban.com/chart，這是個排行榜的頁面，電影比較受歡迎。

python代碼：

#這段代碼不能格式化發

# coding=UTF-8

import urllib2

from BeautifulSoup import *

from urlparse import urljoin

from pysqlite2 import dbapi2 as sqlite

from movie.models import *

from django.contrib.auth.models import User

from time import sleep

p_w_picpath_path = 'C:/Users/soul/djcodetest/picture/'

user = User.objects.get(id=1)

def crawl(depth=10):

for i in range(1,depth):

print '開始抓取 for %d....'%i

pages = Crawl_URL.objects.filter(is_save=False)

newurls={}

for crawl_page in pages:

page = crawl_page.url

try:

c=urllib2.urlopen(page)

except:

continue

try:

#解析元數據和url

soup=BeautifulSoup(c.read())

#解析電影頁面

if re.search(r'^http://movie.douban.com/subject/(/d+)/$',page):

read_html(soup)

#解析出有效的鏈接，放入newurls

links=soup('a')

for link in links:

if 'href' in dict(link.attrs):

url=urljoin(page,link['href'])

if url.find("'")!=-1: continue

if len(url) > 60: continue

url=url.split('#')[0] # removie location portion

if re.search(r'^http://movie.douban.com', url):

newurls[url]= crawl_page.weight + 1 #連接有效。存入字典中

try:

print 'add url :'

except:

pass

except Exception.args:

try:

print "Could not parse : %s" % args

except:

pass

#newurls存入數據庫 is_save=False weight=i

crawl_page.is_save = True

crawl_page.save()

#休眠2.5秒

sleep(2.5)

save_url(newurls)

#保存url，放到數據庫裏

def save_url(newurls):

for (url,weight) in newurls.items():

url = Crawl_URL(url=url,weight=weight)

try:

url.save()

except:

try:

print 'url重複:'

except:

pass

return True

第三步，用BeautifulSoup解析頁面

抽取出電影標題，圖片，劇情介紹，主演，標籤，地區。關於BeautifulSoup的使用可以看這裏BeautifulSoup技術文檔

view plain 
#抓取數據   
def read_html(soup):   
    #解析出標題   
    html_title = soup.html.head.title.string   
    title = html_title[:len(html_title)-5]   
    #解析出電影介紹   
    try:   
        intro = soup.find('span',attrs={'class':'all hidden'}).text   
    except:   
        try:   
            node = soup.find('div',attrs={'class':'blank20'}).previousSibling   
            intro = node.contents[0]+node.contents[2]   
        except:   
            try:   
                contents = soup.find('div',attrs={'class':'blank20'}).previousSibling.previousSibling.text   
                intro = contents[:len(contents)-22]   
            except:   
                intro = u'暫無'   
       
    #取得圖片   
    html_p_w_picpath = soup('a',href=re.compile('douban.com/lpic'))[0]['href']   
    data = urllib2.urlopen(html_p_w_picpath).read()   
    p_w_picpath = '201003/'+html_p_w_picpath[html_p_w_picpath.rfind('/')+1:]   
    f = file(p_w_picpath_path+p_w_picpath,'wb')   
    f.write(data)   
    f.close()   
       
           
    #解析出地區   
    try:   
        soupsoup_obmo = soup.find('div',attrs={'class':'obmo'}).findAll('span')   
        html_area = soup_obmo[0].nextSibling.split('/')   
        area = html_area[0].lstrip()   
    except:   
        area = ''   
       
    #time = soup_obmo[1].nextSibling.split(' ')[1]   
    #timetime = time.strptime(html_time,'%Y-%m-%d')   
       
    #生成電影對象   
    new_movie = Movie(titletitle=title,introintro=intro,areaarea=area,version='暫無',upload_user=user,p_w_picpathp_w_picpath=p_w_picpath)   
    new_movie.save()   
    try:   
        actors = soup.find('div',attrs={'id':'info'}).findAll('span')[5].nextSibling.nextSibling.string.split(' ')[0]   
        actors_list = Actor.objects.filter(name = actors)   
        if len(actors_list) == 1:   
            actor = actors_list[0]   
            new_movie.actors.add(actor)   
        else:   
            actor = Actor(name=actors)   
            actor.save()       
            new_movie.actors.add(actor)   
    except:   
        pass   
       
    #tag   
    tags = soup.find('div',attrs={'class':'blank20'}).findAll('a')   
    for tag_html in tags:   
        tag_str = tag_html.string   
        if len(tag_str) > 4:   
            continue   
        tag_list = Tag.objects.filter(name = tag_str)   
        if len(tag_list) == 1:   
            tag = tag_list[0]   
               
            new_movie.tags.add(tag)   
        else:   
            tag = Tag(name=tag_str)   
            tag.save()     
            new_movie.tags.add(tag)   
    #try:   
           
    #except Exception.args:   
    #   print "Could not download : %s" % args   
    print r'download success'

豆瓣的電影頁面並不是很對稱，所以有時候抓取的結果可能會有點出入

Django+python+BeautifulSoup垂直搜索爬蟲

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

leetcode 60 排列序列

一個docker容器暴露多個端口

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Python正則表達式操作指南

使用mechanize和Beautiful Soup輕鬆收集Web數據

python 實例一則

自制爬蟲例--抓取網站圖像與簡介

python時間轉爲時間戳

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結