本系列實踐目的:
打算先利用github上的項目huatian-funny,通過python抓取花田網上註冊用戶的數據,做個小實驗,然後上傳自己修改後的 huatian-funny 項目。
在 huatian-funny ,我們可以看到該項目的說明:
1.準備
需要 :
requests >=2.7.0,pymongo>=3.2.2,matplotlib>=1.4.3,Pillow>=3.2.0
(1)安裝requests 2.7.0
requests是python的一個HTTP客戶端庫.
源碼安裝 pip 或者easy_install,
>pip install requests
可以看到安的版本是2.10.0
(2)安裝matplotlib
見 python實踐之準備 (一)的第4部分內容——安裝matplotlib。這裏不再贅述。
(3)安裝Pillow
>pip install pillow
(4)安裝mongodb
可以從這裏下載: mongodb下載。
下載完成後,運行 mongodb-win32-x86_64-2008plus-ssl-3.2.6-signed.msi,一路默認選下去,最後完成。
mongodb 默認安裝在 C:\Program Files\MongoDB下。
Windows下 MongoDB 的默認目錄是C:\data\db,需提前創建該目錄。
· 啓動mongod 服務,雙擊運行mongod.exe 即可,或者啓動時附加參數,
mongod.exe -journal -rest
如果不想用默認的C:\data\db目錄,需要在啓動服務器時使用–dbpath選項,如,
mongod.exe --dbpath yourpath
啓動參數有:
–-dbpath:數據庫目錄;
–-logpath:log目錄;
--journal:代表要寫日誌;
--rest:代表可以允許客戶端通過rest API訪問MongoDB Server;
啓動後,命令窗口如下圖所示:
最後一行顯示等待連接。
· 開始連接
雙擊運行mongo.exe,或者再打開一個命令端,輸入mongo.exe
連接數據庫,如圖,
可進行的操作,更多操作請自行搜索。
show dbs
show databases
#顯示所有數據庫
再看剛纔打開的mongod.exe命令窗口,連接數變成了1,如圖
(5) 安裝pymongo
爬蟲爬取的數據放在pymongo中。
安裝pymongo
>pip install pymongo
升級pymongo
>pip install --upgrade pymongo
(6)安裝mongoDB可視化工具——Robomongo
Robomongo是MongoDB/GUI管理工具。
下載地址爲 Robomongo,我下的是robomongo-0.9.0-rc8-windows-x86_64-c113244.exe ,雙擊運行,選擇安裝目錄,我的是D:\softwares_diy\Robomongo 0.9.0-RC8\,繼續,只有幾步,最後選立即運行robomongo,出現下圖,點擊create,新建一個連接,確保啓動了mongod服務(執行了mongod.exe)的前提下點擊Test:
上圖最後一行是 等待連接端口27017,然後回到robomongo,點擊Test:
連接成功。如果連接的是本地的mongodb,直接點“close”,然後“save” 即可。
在robomongo管理頁面上,點擊 file->connect,出現剛纔建立的連接:
選中連接,點“ connect”,可對該連接進行管理:
如果不是連接本地的mongo,那麼通過SSH連接即可,輸入IP 、用戶名、密碼即可:
2.爬取數據
好的,現在我們已經成功安好了需要的組件,而且也打開了mongo數據庫連接。
下載github 上的 huatian-funny 項目,解壓縮後放到一個目錄下,例如我的是D:\pythonExperiments\huatian-funny-master。
我做的修改:
spider.py 和 mark.py
由於我的python環境是python3.4 ,而該項目作者使用的是python2.x,而python2.x 和 python3.x的語法和庫名有些不一樣,因此我對spider.py mark.py 等py文件做了些許修改,使其可以正常運行。該項目作者寫的spider.py文件一次抓取很快就完成並停止了,經過修改後,spider.py 可以每隔5分鐘自動執行一次,達到自動持續抓取數據的目的。
修改後的 spider.py ——爬取數據程序:
# -*- coding=utf-8 -*-
import urllib,urllib.parse
from apscheduler.schedulers.blocking import BlockingScheduler
import os
from requests import Session
from extension import mongo_collection
session = Session()
LOGIN_HEADERS = {
'Host': 'reg.163.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,'
'image/webp,*/*;q=0.8',
'Origin': 'http://love.163.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/49.0.2623.110 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://love.163.com/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
'Cookie': '_ntes_nnid=d53195032b58604628528cd6a374d63f,1460206631682; '
'_ntes_nuid=d53195032b58604628528cd6a374d63f',
}
SEARCH_HEADERS = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Host': 'love.163.com',
'Origin': 'http://love.163.com',
'Pragma': 'no-cache',
'Referer': 'http://love.163.com/search/user',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/49.0.2623.110 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def login():
"""登陸花田"""
data = {
'username': '[email protected]',
'password': 'wangyi887',
'url': 'http://love.163.com/?checkUser=1&vendor=love.pLogin',
'product': 'ht',
'type': '1',
'append': '1',
'savelogin': '1',
}
response = session.post('https://reg.163.com/logins.jsp',
headers=LOGIN_HEADERS, data=urllib.parse.urlencode(data))
assert response.ok
def search():
"""按照上海各個區和年齡段進行搜索"""
for city in range(1, 20):
for age in range(22, 27, 2):
data = {
'province': '2',
'city': str(city),
'age': '{}-{}'.format(age, age + 1),
'condition': '1',
}
response = session.post('http://love.163.com/search/user/list',
headers=SEARCH_HEADERS, data=urllib.parse.urlencode(data))
if not response.ok:
print ('city:{} age:{} failed').format(city, age)
continue
users = response.json()['list']
for user in users:
mongo_collection.update({'id': user['id']}, user, upsert=True)
def loginAndSearch():
login()
search()
if __name__ == '__main__':
#每隔 5 分鐘執行一次,你可以根據需要修改 interval。
scheduler = BlockingScheduler()
scheduler.add_job(loginAndSearch,'interval', minutes=5)
print ('Press Ctrl+{0} to exit'.format('Pause/Break' if os.name == 'nt' else 'C'))
try:
scheduler.start()
except (KeyboardInterrupt,SystemExit):
scheduler.shutdown()
修改後的 mark.py ——主觀打分程序:
# -*- coding=utf-8 -*-
"""打分程序"""
import io
from urllib import request
from tkinter import messagebox,Tk, font, Label, Button, Radiobutton, IntVar
#import tkinter.font as Font
#from tkinter import *
from PIL import Image, ImageTk
from extension import mongo_collection, BUY_HOUSE, BUY_CAR,\
EDUCATION, INDUSTRY, SALARY, POSITION
master = None
tk_image = None
offset = 0
user, photo, url, buy_house, buy_car, age, height, salary, education, company, \
industry, school, position, satisfy, appearance = [None for i in range(15)]
def get_user(offset=0):
"""mongo中讀取用戶信息"""
global user
user = mongo_collection.find_one({}, skip=offset, limit=1, sort=[('url', -1)])
def init_master():
"""初始化主窗口"""
global master
master = Tk()
master.title(u'花田')
master.geometry(u'630x530')
master.resizable(width=False, height=False)
def place_image(image_ur):
"""獲取用戶頭像"""
global tk_image
image_bytes = request.urlopen(image_ur).read()
data_stream = io.BytesIO(image_bytes)
pil_image = Image.open(data_stream)
tk_image = ImageTk.PhotoImage(pil_image)
def set_appearance():
"""設置頭像評分"""
mongo_collection.update({'url': user['url']},
{'$set': {'appearance': appearance.get()}})
def set_satisfy():
"""設置是否滿意"""
mongo_collection.update({'url': user['url']},
{'$set': {'satisfy': satisfy.get()}})
def update():
"""更新頁面"""
global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
education, company, industry, school, position, satisfy, appearance
image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
place_image(image_url)
print (offset)
photo['image'] = tk_image
url['text'] = user['url']
buy_house['text'] = BUY_HOUSE.get(user['house']) or user['house']
buy_car['text'] = BUY_CAR.get(user['car']) or user['car']
age['text'] = user['age']
height['text'] = user['height']
salary['text'] = SALARY.get(user['salary']) or user['salary']
education['text'] = EDUCATION.get(user['education']) or user['education']
company['text'] = user['company'] if user['company'] else u'--'
industry['text'] = INDUSTRY.get(user['industry']) or user['industry']
school['text'] = user['school'] if user['school'] else u'--'
position = POSITION.get(user['position']) or user['position']
satisfy.set(int(user.get(u'satisfy', -1)))
appearance.set(int(user.get(u'appearance', -1)))
def init():
"""初始化頁面"""
global user, offset, photo, url, buy_house, buy_car, age, height, salary, \
education, company, industry, school, position, satisfy, appearance
get_user(offset)
image_url = u'{}&quality=85&thumbnail=410y410'.format(user['avatar'])
place_image(image_url)
photo = Label(master, image=tk_image)
photo.place(anchor=u'nw', x=10, y=40)
#url = Label(master, text=user['url'],font=Font(size=20, weight='bold'))
url = Label(master, font=("20"), text=user['url'])
url.place(anchor=u'nw', x=10, y=5)
buy_house = Label(master, text=BUY_HOUSE.get(user['house']) or user['house'])
buy_house.place(anchor=u'nw', x=490, y=50)
buy_car = Label(master, text=BUY_CAR.get(user['car']) or user['car'])
buy_car.place(anchor=u'nw', x=490, y=75)
age = Label(master, text=user['age'])
age.place(anchor=u'nw', x=490, y=100)
height = Label(master, text=user['height'])
height.place(anchor=u'nw', x=490, y=125)
salary = Label(master, text=SALARY.get(user['salary']) or user['salary'])
salary.place(anchor=u'nw', x=490, y=150)
education = Label(master, text=EDUCATION.get(user['education']) or user['education'])
education.place(anchor=u'nw', x=490, y=175)
company = Label(master, text=user['company'] if user['company'] else u'--')
company.place(anchor=u'nw', x=490, y=200)
industry = Label(master, text=INDUSTRY.get(user['industry']) or user['industry'])
industry.place(anchor=u'nw', x=490, y=225)
school = Label(master, text=user['school'] if user['school'] else u'--')
school.place(anchor=u'nw', x=490, y=250)
position = Label(master, text=POSITION.get(user['position']) or user['position'])
position.place(anchor=u'nw', x=490, y=275)
satisfy = IntVar()
satisfy.set(int(user.get(u'satisfy', -1)))
satisfied = Radiobutton(master, text=u"滿意", variable=satisfy,
value=1, command=set_satisfy)
not_satisfied = Radiobutton(master, text=u"不滿意", variable=satisfy,
value=0, command=set_satisfy)
satisfied.place(anchor=u'nw', x=450, y=10)
not_satisfied.place(anchor=u'nw', x=510, y=10)
appearance = IntVar()
appearance.set(int(user.get(u'appearance', -1)))
for i in range(1, 11):
score_i = Radiobutton(master, text=str(i), variable=appearance,
value=i, command=set_appearance)
score_i.place(anchor=u'nw', x=i * 40 - 30, y=460)
def handle_previous():
"""上一個用戶"""
global offset
if offset <= 0:
showwarning(u'error', u'已經是第一個')
offset -= 1
get_user(offset)
update()
def handle_next():
"""下一個用戶"""
global offset
offset += 1
get_user(offset)
if not user:
showwarning(u'error', u'已經是第後一個')
return
update()
def add_assembly():
"""添加組件"""
init()
#buy_house_static = Label(master, text=u'購房: ', fontt=font(size=15))
buy_house_static = Label(master, font=("15"), text=u'購房: ')
buy_house_static.place(anchor=u'nw', x=440, y=50)
buy_car_static = Label(master, font=("15"), text=u'購車: ')
buy_car_static.place(anchor=u'nw', x=440, y=75)
age_static = Label(master, font=("15"), text=u'年齡: ')
age_static.place(anchor=u'nw', x=440, y=100)
height_static = Label(master, font=("15"), text=u'身高: ')
height_static.place(anchor=u'nw', x=440, y=125)
salary_static = Label(master, font=("15"), text=u'工資: ')
salary_static.place(anchor=u'nw', x=440, y=150)
education_static = Label(master, font=("15"), text=u'學歷: ')
education_static.place(anchor=u'nw', x=440, y=175)
company_static = Label(master, font=("15"), text=u'公司: ')
company_static.place(anchor=u'nw', x=440, y=200)
industry_static = Label(master, font=("15"), text=u'行業: ')
industry_static.place(anchor=u'nw', x=440, y=225)
school_static = Label(master, font=("15"), text=u'學校: ')
school_static.place(anchor=u'nw', x=440, y=250)
position_static = Label(master, font=("15"), text=u'職位: ')
position_static.place(anchor=u'nw', x=440, y=275)
previous = Button(master, text=u'上一個', command=handle_previous)
previous.place(anchor=u'nw', x=10, y=490)
next = Button(master, text=u'下一個', command=handle_next)
next.place(anchor=u'nw', x=520, y=490)
if __name__ == '__main__':
init_master()
add_assembly()
master.mainloop()
對於train.py我還木有進行修改調試,所以關於訓練決策樹的部分還木有實踐。