python實踐系列之(二)python爬取數據(上)

本系列實踐目的:

打算先利用github上的項目huatian-funny,通過python抓取花田網上註冊用戶的數據,做個小實驗,然後上傳自己修改後的 huatian-funny 項目。

huatian-funny ,我們可以看到該項目的說明:

這裏寫圖片描述
這裏寫圖片描述
這裏寫圖片描述
這裏寫圖片描述
這裏寫圖片描述
這裏寫圖片描述
這裏寫圖片描述


1.準備

需要 :

requests >=2.7.0,pymongo>=3.2.2,matplotlib>=1.4.3,Pillow>=3.2.0

(1)安裝requests 2.7.0

requests是python的一個HTTP客戶端庫.
源碼安裝 pip 或者easy_install,

>pip install requests

這裏寫圖片描述

可以看到安的版本是2.10.0

(2)安裝matplotlib

python實踐之準備 (一)的第4部分內容——安裝matplotlib。這裏不再贅述。

(3)安裝Pillow

>pip install pillow

這裏寫圖片描述

(4)安裝mongodb

可以從這裏下載: mongodb下載
下載完成後,運行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi,一路默認選下去,最後完成。
mongodb 默認安裝在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默認目錄是C:\data\db,需提前創建該目錄。

· 啓動mongod 服務,雙擊運行mongod.exe 即可,或者啓動時附加參數,

mongod.exe -journal -rest

如果不想用默認的C:\data\db目錄,需要在啓動服務器時使用–dbpath選項,如,

mongod.exe --dbpath yourpath
啓動參數有:
–-dbpath:數據庫目錄;
–-logpath:log目錄;
--journal:代表要寫日誌;
--rest:代表可以允許客戶端通過rest API訪問MongoDB Server;

啓動後,命令窗口如下圖所示:

這裏寫圖片描述

最後一行顯示等待連接。

· 開始連接

雙擊運行mongo.exe,或者再打開一個命令端,輸入mongo.exe 連接數據庫,如圖,

這裏寫圖片描述

可進行的操作,更多操作請自行搜索。

show dbs
show databases
#顯示所有數據庫

再看剛纔打開的mongod.exe命令窗口,連接數變成了1,如圖

這裏寫圖片描述

(5) 安裝pymongo

爬蟲爬取的數據放在pymongo中。
安裝pymongo

>pip install pymongo

升級pymongo

>pip install --upgrade pymongo

這裏寫圖片描述

(6)安裝mongoDB可視化工具——Robomongo

Robomongo是MongoDB/GUI管理工具。
下載地址爲 Robomongo,我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ,雙擊運行,選擇安裝目錄,我的是D:\softwares_diy\Robomongo 0.9.0-RC8\,繼續,只有幾步,最後選立即運行robomongo,出現下圖,點擊create,新建一個連接,確保啓動了mongod服務(執行了mongod.exe)的前提下點擊Test:

這裏寫圖片描述

上圖最後一行是 等待連接端口27017,然後回到robomongo,點擊Test:

這裏寫圖片描述

這裏寫圖片描述
連接成功。如果連接的是本地的mongodb,直接點“close”,然後“save” 即可。
在robomongo管理頁面上,點擊 file->connect,出現剛纔建立的連接:

這裏寫圖片描述

選中連接,點“ connect”,可對該連接進行管理:

這裏寫圖片描述

如果不是連接本地的mongo,那麼通過SSH連接即可,輸入IP 、用戶名、密碼即可:

這裏寫圖片描述


2.爬取數據

好的,現在我們已經成功安好了需要的組件,而且也打開了mongo數據庫連接。

下載github 上的 huatian-funny 項目,解壓縮後放到一個目錄下,例如我的是D:\pythonExperiments\huatian-funny-master。

我做的修改:

  • spider.py 和 mark.py
    由於我的python環境是python3.4 ,而該項目作者使用的是python2.x,而python2.x 和 python3.x的語法和庫名有些不一樣,因此我對spider.py mark.py 等py文件做了些許修改,使其可以正常運行。

  • 該項目作者寫的spider.py文件一次抓取很快就完成並停止了,經過修改後,spider.py 可以每隔5分鐘自動執行一次,達到自動持續抓取數據的目的。

修改後的 spider.py ——爬取數據程序:

# -*- coding=utf-8 -*-
import urllib,urllib.parse
from apscheduler.schedulers.blocking import BlockingScheduler
import os
from requests import Session
from extension import mongo_collection

session = Session()
LOGIN_HEADERS = {
    'Host': 'reg.163.com',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,'
              'image/webp,*/*;q=0.8',
    'Origin': 'http://love.163.com',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/49.0.2623.110 Safari/537.36',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Referer': 'http://love.163.com/',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
    'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; '
              '_ntes_nuid=d53195032b58604628528cd6a374d63f',
}
SEARCH_HEADERS = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Host': 'love.163.com',
    'Origin': 'http://love.163.com',
    'Pragma': 'no-cache',
    'Referer': 'http://love.163.com/search/user',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/49.0.2623.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}


def login():
    """登陸花田"""
    data = {
        'username': '[email protected]',
        'password': 'wangyi887',
        'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin',
        'product': 'ht',
        'type': '1',
        'append': '1',
        'savelogin': '1',
    }
    response = session.post('https://reg.163.com/logins.jsp',
                            headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data))
    assert response.ok


def search():
    """按照上海各個區和年齡段進行搜索"""
    for city in range(1, 20):
        for age in range(22, 27, 2):
            data = {
                'province': '2',
                'city': str(city),
                'age': '{}-{}'.format(age, age + 1),
                'condition': '1',
            }
            response = session.post('http://love.163.com/search/user/list',
                                    headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data))
            if not response.ok:
                print ('city:{} age:{} failed').format(city, age)
                continue

            users = response.json()['list']
            for user in users:
                mongo_collection.update({'id': user['id']}, user, upsert=True)

def loginAndSearch():
    login()
    search()

if __name__ == '__main__':

    #每隔 5 分鐘執行一次,你可以根據需要修改 interval。
    scheduler = BlockingScheduler()
    scheduler.add_job(loginAndSearch,'interval', minutes=5)
    print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C'))
    try:
        scheduler.start()
    except (KeyboardInterrupt,SystemExit):
        scheduler.shutdown()

修改後的 mark.py ——主觀打分程序:

# -*- coding=utf-8 -*-
"""打分程序"""

import io
from urllib import request
from tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar
#import tkinter.font as Font
#from tkinter import *
from PIL import Image, ImageTk
from extension import mongo_collection, BUY_HOUSE, BUY_CAR,\
    EDUCATION, INDUSTRY, SALARY, POSITION

master = None
tk_image = None

offset = 0
user, photo, url, buy_house, buy_car, age, height, salary, education, company, \
industry, school, position, satisfy, appearance = [None for i in range(15)]


def get_user(offset=0):
    """mongo中讀取用戶信息"""
    global user
    user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])


def init_master():
    """初始化主窗口"""
    global master
    master = Tk()
    master.title(u'花田')
    master.geometry(u'630x530')
    master.resizable(width=False, height=False)


def place_image(image_ur):
    """獲取用戶頭像"""
    global tk_image
    image_bytes = request.urlopen(image_ur).read()
    data_stream = io.BytesIO(image_bytes)
    pil_image = Image.open(data_stream)
    tk_image = ImageTk.PhotoImage(pil_image)


def set_appearance():
    """設置頭像評分"""
    mongo_collection.update({'url': user['url']},
                            {'$set': {'appearance': appearance.get()}})


def set_satisfy():
    """設置是否滿意"""
    mongo_collection.update({'url': user['url']},
                            {'$set': {'satisfy': satisfy.get()}})


def update():
    """更新頁面"""
    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
        education, company, industry, school, position, satisfy, appearance
    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
    place_image(image_url)

    print (offset)

    photo['image'] = tk_image
    url['text'] = user['url']
    buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house']
    buy_car['text'] = BUY_CAR.get(user['car']) or user['car']
    age['text'] = user['age']
    height['text'] = user['height']
    salary['text'] = SALARY.get(user['salary']) or user['salary']
    education['text'] = EDUCATION.get(user['education']) or user['education']
    company['text'] = user['company'] if user['company'] else u'--'
    industry['text'] = INDUSTRY.get(user['industry']) or user['industry']
    school['text'] = user['school'] if user['school'] else u'--'
    position = POSITION.get(user['position']) or user['position']

    satisfy.set(int(user.get(u'satisfy', -1)))
    appearance.set(int(user.get(u'appearance', -1)))


def init():
    """初始化頁面"""
    global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
        education, company, industry, school, position, satisfy, appearance
    get_user(offset)
    image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
    place_image(image_url)
    photo = Label(master, image=tk_image)
    photo.place(anchor=u'nw', x=10, y=40)
    #url = Label(master, text=user['url'],font=Font(size=20, weight='bold'))
    url = Label(master, font=("20"), text=user['url'])
    url.place(anchor=u'nw', x=10, y=5)
    buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house'])
    buy_house.place(anchor=u'nw', x=490, y=50)
    buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car'])
    buy_car.place(anchor=u'nw', x=490, y=75)
    age = Label(master, text=user['age'])
    age.place(anchor=u'nw', x=490, y=100)
    height = Label(master, text=user['height'])
    height.place(anchor=u'nw', x=490, y=125)
    salary = Label(master, text=SALARY.get(user['salary']) or user['salary'])
    salary.place(anchor=u'nw', x=490, y=150)
    education = Label(master, text=EDUCATION.get(user['education']) or user['education'])
    education.place(anchor=u'nw', x=490, y=175)
    company = Label(master, text=user['company'] if user['company'] else u'--')
    company.place(anchor=u'nw', x=490, y=200)
    industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry'])
    industry.place(anchor=u'nw', x=490, y=225)
    school = Label(master, text=user['school'] if user['school'] else u'--')
    school.place(anchor=u'nw', x=490, y=250)
    position = Label(master, text=POSITION.get(user['position']) or user['position'])
    position.place(anchor=u'nw', x=490, y=275)

    satisfy = IntVar()
    satisfy.set(int(user.get(u'satisfy', -1)))
    satisfied = Radiobutton(master, text=u"滿意", variable=satisfy,
                            value=1, command=set_satisfy)
    not_satisfied = Radiobutton(master, text=u"不滿意", variable=satisfy,
                                value=0, command=set_satisfy)
    satisfied.place(anchor=u'nw', x=450, y=10)
    not_satisfied.place(anchor=u'nw', x=510, y=10)

    appearance = IntVar()
    appearance.set(int(user.get(u'appearance', -1)))
    for i in range(1, 11):
        score_i = Radiobutton(master, text=str(i), variable=appearance,
                              value=i, command=set_appearance)
        score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)


def handle_previous():
    """上一個用戶"""
    global offset
    if offset <= 0:
        showwarning(u'error', u'已經是第一個')

    offset -= 1
    get_user(offset)
    update()


def handle_next():
    """下一個用戶"""
    global offset

    offset += 1
    get_user(offset)
    if not user:
        showwarning(u'error', u'已經是第後一個')
        return
    update()


def add_assembly():
    """添加組件"""
    init()

    #buy_house_static = Label(master, text=u'購房: ', fontt=font(size=15))
    buy_house_static = Label(master, font=("15"), text=u'購房: ')
    buy_house_static.place(anchor=u'nw', x=440, y=50)
    buy_car_static = Label(master, font=("15"), text=u'購車: ')
    buy_car_static.place(anchor=u'nw', x=440, y=75)
    age_static = Label(master, font=("15"), text=u'年齡: ')
    age_static.place(anchor=u'nw', x=440, y=100)
    height_static = Label(master, font=("15"), text=u'身高: ')
    height_static.place(anchor=u'nw', x=440, y=125)
    salary_static = Label(master, font=("15"), text=u'工資: ')
    salary_static.place(anchor=u'nw', x=440, y=150)
    education_static = Label(master, font=("15"), text=u'學歷: ')
    education_static.place(anchor=u'nw', x=440, y=175)
    company_static = Label(master, font=("15"), text=u'公司: ')
    company_static.place(anchor=u'nw', x=440, y=200)
    industry_static = Label(master, font=("15"), text=u'行業: ')
    industry_static.place(anchor=u'nw', x=440, y=225)
    school_static = Label(master, font=("15"), text=u'學校: ')
    school_static.place(anchor=u'nw', x=440, y=250)
    position_static = Label(master, font=("15"), text=u'職位: ')
    position_static.place(anchor=u'nw', x=440, y=275)
    previous = Button(master, text=u'上一個', command=handle_previous)
    previous.place(anchor=u'nw', x=10, y=490)
    next = Button(master, text=u'下一個', command=handle_next)
    next.place(anchor=u'nw', x=520, y=490)


if __name__ == '__main__':
    init_master()
    add_assembly()
    master.mainloop()

對於train.py我還木有進行修改調試,所以關於訓練決策樹的部分還木有實踐。

參考:
1. MongoDB與PyMongo的安裝(Linux/Windows XP)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章