今天,做了一個爬蟲的新項目——IT桔子(www.itjuzi.com/company)的信息爬取.
IT桔子是關注IT互聯網行業的結構化的公司數據庫和商業信息提供商
IT桔子致力於通過信息和數據的生產、聚合、挖掘、加工、處理,幫助目標用戶和客戶節約時間和金錢、提高效率,以輔助其各類商業行爲,包括風險投資、收購、競爭情報、細分行業信息、國外公司產品信息數據服務等。
- 所以IT桔子網站的信息是很有商業價值的,是有爬取需求的。當然,也是有一定爬取難度的, 網站對一般正常客戶(非會員),都進行了限制訪問處理,一般的反反爬措施是都要上的,如user-agent,host,cookies等對request對象請求頭的設置,隨機ip代理等;頁面的html代碼是動態HTML的,所以直接爬取拿不到數據,需要考慮使用selinium模擬Chrome瀏覽器,等待渲染完之後再提取數據;而且,不同的詳情頁面的HTML代碼還有不同的代碼格式,需要在測試的過程中不斷地增強爬蟲代碼的健壯性等等。
步驟
明確目標數據,詳情頁的:(items.py)
a.公司簡介:company_name ;company_slogan; company_link; company_tags
b.公司基本信息:company_info; company_full_name; create_time; company_size; company_status
c.融資情況:invest_list (列表結構,每一個元素是一個字典,存放一次融資記錄)
d.團隊信息:team_list (列表結構,每一個元素是一個字典,存放一個負責人信息)
e.產品信息:product_list(列表結構,每一個元素是一個字典,存放一個產品的信息)
分析網站的爬取思路:
a.信息都在詳情頁,只需要更改url尾綴數字即可;
b.因爲有的是靜態頁面有的是動態頁面,所以要使用,selenium工具模擬chrome瀏覽器訪問;
c.登錄後纔可以請求完整信息,要模擬登陸/攜帶cookies信息訪問
d.html頁面結構不一致,增加判斷語句,增強爬蟲代碼的健壯性
e.遇到其他問題,根據具體情況分析,解決問題選擇框架,scrapy;選擇模塊,spider/crawl_spider都可以
- 爬取一頁數據(itjuzi.py)
- 解析數據(itjuzi.py)
- 存儲數據,測試爬取結果的正確性(pipelines.py, settings.py)
- 開啓循環,正式爬取
spider模塊爬取
items.py
import scrapy
class JuziItem(scrapy.Item):
# 1.公司簡介
company_name = scrapy.Field()
company_slogan = scrapy.Field()
company_link = scrapy.Field()
company_tags = scrapy.Field()
# 2.公司基本信息
company_info = scrapy.Field()
company_full_name = scrapy.Field()
create_time = scrapy.Field()
company_size = scrapy.Field()
company_status = scrapy.Field()
# 3. 融資
invest_list = scrapy.Field()
# 4. 團隊信息
team_list = scrapy.Field()
# 5. 產品信息
product_list = scrapy.Field()
url_link = scrapy.Field()
# 數據源
data_source = scrapy.Field()
data_time = scrapy.Field()
itjuzi.py
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from ITJuzi.items import JuziItem
class JuziSpider(scrapy.Spider):
name = 'itjuzi'
allowed_domains = ['itjuzi.com']
base_url = 'https://www.itjuzi.com/company/'
offset = 1
start_urls = [base_url + str(offset)]
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "gr_user_id=8b2a0647-ed6e-4da9-bd79-0927840738ba; _ga=GA1.2.1065816449.1520818726; MEIQIA_EXTRA_TRACK_ID=11oNlg9W4BPRdVJbbc5Mg9covSB; _gid=GA1.2.1051909062.1524629235; acw_tc=AQAAADMxgTrydgkAxrxRZa/yV6lXP/Tv; Hm_lvt_1c587ad486cdb6b962e94fc2002edf89=1524629235,1524637618,1524702648; gr_session_id_eee5a46c52000d401f969f4535bdaa78=5ac2fdfd-b747-46e3-84a3-573d49e8f0f0_true; identity=1019197976%40qq.com; remember_code=N8cv8vX9xK; unique_token=498323; acw_sc__=5ae1302fee977bcf1d5f28b7fe96b94d7b5de97c; session=e12ae81c38e8383dcaeaaff9ded967758bc5a01c; Hm_lpvt_1c587ad486cdb6b962e94fc2002edf89=1524707391",
"Host": "www.itjuzi.com",
"If-Modified-Since": "Thu, 26 Apr 2018 01:49:47 GMT",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36",
}
# 設置cookie登錄的驗證
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)
def parse(self, response):
# 解析數據
soup = BeautifulSoup(response.body, 'lxml')
item = JuziItem()
item['url_link'] = response.url
# 1.公司簡介
# cpy1 = soup.find(attrs={'class':"infoheadrow-v2"})
cpy1 = soup.find(class_='infoheadrow-v2')
if cpy1:
item['company_name'] = cpy1.select('.seo-important-title')[0].get('data-name')
item['company_slogan'] = cpy1.select('.seo-slogan')[0].get_text()
item['company_link'] = cpy1.select('.link-line a')[-1].get_text().strip()
tag_list = cpy1.select('.tag-list a')
tag_str = ""
for tag in tag_list:
tag_str += tag.get_text().strip() + " "
item['company_tags'] = tag_str
# 2.公司基本信息
cpy2 = soup.find(class_='block-inc-info')
if cpy2:
item['company_info'] = cpy2.select('.block div')[-1].get_text().strip()
item['company_full_name'] = cpy2.select('.des-more h2')[0].get_text().strip()
item['create_time'] = cpy2.select('.des-more h3')[0].get_text().strip()
item['company_size'] = cpy2.select('.des-more h3')[1].get_text().strip()
item['company_status'] = cpy2.select('.pull-right')[0].get_text().strip()
# 3. 融資
cpy3 = soup.find(attrs={'id': "invest-portfolio"})
if cpy3:
tr_list = cpy3.select('tr')
inv_list = []
for tr in tr_list:
if len(tr.select('td')) > 2:
tr_dict = {}
tr_dict['time'] = tr.select('td')[0].get_text().strip()
tr_dict['round'] = tr.select('td')[1].get_text().strip()
tr_dict['money'] = tr.select('td')[2].get_text().strip()
tr_dict['name'] = tr.select('td')[3].get_text().strip()
inv_list.append(tr_dict)
item['invest_list'] = inv_list
# 4. 團隊信息
cpy4 = soup.select('.team-list')[0]
if cpy4:
tea_list = cpy4.select('li')
team_temp_list = []
for tr in tea_list:
tr_dict = {}
tr_dict['name'] = tr.select('.per-name')[0].get_text().strip()
tr_dict['position'] = tr.select('.per-position')[0].get_text().strip()
tr_dict['info'] = tr.select('.per-des')[0].get_text().strip()
team_temp_list.append(tr_dict)
item['team_list'] = team_temp_list
# 5. 產品信息
cpy5 = soup.select('.product-list')[0]
if cpy5:
li_list = cpy5.select('li')
pro_temp_list = []
for tr in li_list:
tr_dict = {}
tr_dict['name'] = tr.select('.product-name')[0].get_text().strip()
tr_dict['info'] = tr.select('.product-des')[0].get_text().strip()
pro_temp_list.append(tr_dict)
item['product_list'] = pro_temp_list
# 將解析完畢的數據 交給 --引擎 --管道
yield item
self.offset += 1
url = self.base_url + str(self.offset)
yield scrapy.Request(url=url, callback=self.parse)
- 注意,此處儲存數據到redis,使用的是scrapy-redis的內置管道
pipelines.py
from datetime import datetime
class JuziPipeline(object):
def process_item(self, item, spider):
item['data_source'] = spider.name
item['data_time'] = datetime.utcnow()
return item
middlewares.py
class ChromeMiddleware(object):
def process_request(self, request, spider):
driver = webdriver.Chrome()
driver.get(request.url)
time.sleep(5)
data = driver.page_source
driver.quit()
# 攔截系統的下載
return scrapy.http.HtmlResponse(url=request.url, body=data.encode('utf-8'), encoding='utf-8', request=request)
settings.py
BOT_NAME = 'ITJuzi'
SPIDER_MODULES = ['ITJuzi.spiders']
NEWSPIDER_MODULE = 'ITJuzi.spiders'
# 1.設置 分佈式的 去重組件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 2.設置 分佈式的 調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 3.允許爬蟲中途停止 中斷
SCHEDULER_PERSIST = True
ITEM_PIPELINES = {
'ITJuzi.pipelines.JuziPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 400
}
# 4.設置 redis 數據庫的端口號 和IP
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
crawl_spider 模塊爬取
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from ITJuzi.items import JuziItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class JuziSpider(CrawlSpider):
name = 'juzi_crawl'
allowed_domains = ['itjuzi.com']
start_urls = [
# 1.國內創業
'https://www.itjuzi.com/company',
# 2.國內上市
'https://www.itjuzi.com/company/listed',
# 3.國外創業
'https://www.itjuzi.com/company/foreign',
# 4.國外上市
'https://www.itjuzi.com/foreign/listed'
]
# 解析的規則
rules = (
# 1.國內創業--列表頁 沒有callback函數;默認就是follow=True
Rule(LinkExtractor(allow='company\?page=')),
# 2.國內上市 --列表頁
Rule(LinkExtractor(allow='company/listed\?page=')),
# 3.國外創業 --列表頁
Rule(LinkExtractor(allow='company/foreign\?page=')),
# 4.國外上市 --列表頁
Rule(LinkExtractor(allow='company/foreign/listed\?page=')),
# 5詳情頁
Rule(LinkExtractor(allow='company/\d+'), callback="parse_detail", follow=False),
)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "gr_user_id=8b2a0647-ed6e-4da9-bd79-0927840738ba; _ga=GA1.2.1065816449.1520818726; MEIQIA_EXTRA_TRACK_ID=11oNlg9W4BPRdVJbbc5Mg9covSB; _gid=GA1.2.1051909062.1524629235; acw_tc=AQAAADMxgTrydgkAxrxRZa/yV6lXP/Tv; Hm_lvt_1c587ad486cdb6b962e94fc2002edf89=1524629235,1524637618,1524702648; gr_session_id_eee5a46c52000d401f969f4535bdaa78=5ac2fdfd-b747-46e3-84a3-573d49e8f0f0_true; identity=1019197976%40qq.com; remember_code=N8cv8vX9xK; unique_token=498323; acw_sc__=5ae1302fee977bcf1d5f28b7fe96b94d7b5de97c; session=e12ae81c38e8383dcaeaaff9ded967758bc5a01c; Hm_lpvt_1c587ad486cdb6b962e94fc2002edf89=1524707391",
"Host": "www.itjuzi.com",
"If-Modified-Since": "Thu, 26 Apr 2018 01:49:47 GMT",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36",
}
# 設置cookie登錄的驗證
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)
def parse_detail(self, response):
# 解析數據
soup = BeautifulSoup(response.body, 'lxml')
item = JuziItem()
item['url_link'] = response.url
# 1.公司簡介
# cpy1 = soup.find(attrs={'class':"infoheadrow-v2"})
cpy1 = soup.find(class_='infoheadrow-v2')
if cpy1:
item['company_name'] = cpy1.select('.seo-important-title')[0].get('data-name')
item['company_slogan'] = cpy1.select('.seo-slogan')[0].get_text()
item['company_link'] = cpy1.select('.link-line a')[-1].get_text().strip()
tag_list = cpy1.select('.tag-list a')
tag_str = ""
for tag in tag_list:
tag_str += tag.get_text().strip() + " "
item['company_tags'] = tag_str
# 2.公司基本信息
cpy2 = soup.find(class_='block-inc-info')
if cpy2:
item['company_info'] = cpy2.select('.block div')[-1].get_text().strip()
item['company_full_name'] = cpy2.select('.des-more h2')[0].get_text().strip()
item['create_time'] = cpy2.select('.des-more h3')[0].get_text().strip()
item['company_size'] = cpy2.select('.des-more h3')[1].get_text().strip()
item['company_status'] = cpy2.select('.pull-right')[0].get_text().strip()
# 3. 融資
cpy3 = soup.find(attrs={'id': "invest-portfolio"})
if cpy3:
tr_list = cpy3.select('tr')
inv_list = []
for tr in tr_list:
if len(tr.select('td')) > 2:
tr_dict = {}
tr_dict['time'] = tr.select('td')[0].get_text().strip()
tr_dict['round'] = tr.select('td')[1].get_text().strip()
tr_dict['money'] = tr.select('td')[2].get_text().strip()
tr_dict['name'] = tr.select('td')[3].get_text().strip()
inv_list.append(tr_dict)
item['invest_list'] = inv_list
# 4. 團隊信息
cpy4 = soup.select('.team-list')[0]
if cpy4:
tea_list = cpy4.select('li')
team_temp_list = []
for tr in tea_list:
tr_dict = {}
tr_dict['name'] = tr.select('.per-name')[0].get_text().strip()
tr_dict['position'] = tr.select('.per-position')[0].get_text().strip()
tr_dict['info'] = tr.select('.per-des')[0].get_text().strip()
team_temp_list.append(tr_dict)
item['team_list'] = team_temp_list
# 5. 產品信息
cpy5 = soup.select('.product-list')[0]
if cpy5:
li_list = cpy5.select('li')
pro_temp_list = []
for tr in li_list:
tr_dict = {}
tr_dict['name'] = tr.select('.product-name')[0].get_text().strip()
tr_dict['info'] = tr.select('.product-des')[0].get_text().strip()
pro_temp_list.append(tr_dict)
item['product_list'] = pro_temp_list
# 將解析完畢的數據 交給 --引擎 --管道
yield item
- 其他模塊跟 spider模塊一致
scrapy-redis分佈式爬取
使用的spider模塊
步驟
- 導入分佈式模塊:redisSpider
- 修改 JuziSpider類的繼承關係,需要繼承redisSpider
- 設置redis_key
- settings.py中啓用分佈式過濾器:
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
- settings.py中啓用分佈式過濾器:
- settings.py中啓用分佈式 調度器:
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
- settings.py中啓用分佈式 調度器:
- settings.py設置斷點續爬:
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
- settings.py設置斷點續爬:
- settings.py中啓用redis管道(內置的,不需要自定義):
'scrapy_redis.pipelines.RedisPipeline': 400,
- settings.py中啓用redis管道(內置的,不需要自定義):
itjuzi_redis.py
...
from scrapy_redis.spiders import RedisSpider
class JuziSpider(RedisSpider):
name = 'juzi_redis'
allowed_domains = ['itjuzi.com']
redis_key = 'juzikey'
...
settings.py
...
#1. 啓用 分佈式 過濾器
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 2.啓用 分佈式 調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 3.啓用 分佈式 如果爬蟲中斷1000個 ,下次從中斷的位置10001開始下載
SCHEDULER_PERSIST = True
# 4. redis的管道
'scrapy_redis.pipelines.RedisPipeline': 400,
#設置redis host port
REDIS_HOST = '192.168.90.169'
REDIS_PORT = 6379
...
額外補充:
- 分佈式爬蟲當使用scrapy-redis時,redis因爲是存儲在內存中的,這樣,讀寫速度會很快,但是,當需要爬取的數據過多時,不能都存儲在redis中,需要將爬下來的數據另作處理。比如存儲到mongoDB或者MySQL中:
將redis中的數據存儲到mongoDB中
- 啓動mongoDB:
sudo mongod
- 執行下面的程序:
# process_aqi_mongodb.py
# -*- coding: utf-8 -*-
import json
import redis
import pymongo
def main():
# 指定Redis數據庫信息
rediscli = redis.Redis(host='192.168.88.94', port=6379, db=0)
# 指定MongoDB數據庫信息
mongocli = pymongo.MongoClient(host='127.0.0.1', port=27017)
# 創建數據庫名
db = mongocli['aqi']
# 創建表名
sheet = db['aqi_data']
while True:
# FIFO模式爲 blpop,LIFO模式爲 brpop,獲取鍵值
source, data = rediscli.blpop(["aqi:items"])
item = json.loads(data)
sheet.insert(item)
try:
print u"Processing: %(name)s <%(link)s>" % item
except KeyError:
print u"Error procesing: %r" % item
if __name__ == '__main__':
main()
將redis中的數據存儲奧MySQL中
- 啓動數據庫:
mysql-server start
(平臺不同,命令不同) - 登錄到root用戶:
mysql -uroot -p
- 創建數據庫,如 aqi:
create database aqi;
- 切換到指定數據庫:
use aqi
- 創建表aqi_data以及所有字段的列名和數據類型
- 執行下面的程序:
#process_aqi_mysql.py
# -*- coding: utf-8 -*-
import json
import redis
import MySQLdb
def main():
# 指定redis數據庫信息
rediscli = redis.StrictRedis(host='192.168.88.94', port = 6379, db = 0)
# 指定mysql數據庫
mysqlcli = MySQLdb.connect(host='127.0.0.1', user='root', passwd='xxxxxxx', db = 'aqi', port=3306, use_unicode=True)
while True:
# FIFO模式爲 blpop,LIFO模式爲 brpop,獲取鍵值
source, data = rediscli.blpop(["aqi:items"])
item = json.loads(data)
try:
# 使用cursor()方法獲取操作遊標
cur = mysqlcli.cursor()
# 使用execute方法執行SQL INSERT語句
cur.execute("INSERT INTO aqi_data (city, date, aqi, level, pm2_5, pm10, so2, co, no2, o3, rank, spider, crawled) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)", [item['city'], item['date'], item['aqi'], item['level'], item['pm2_5'], item['pm10'], item['so2'], item['co'], item['no2'], item['o3'], item['rank'], item['spider'], item['crawled']])
# 提交sql事務
mysqlcli.commit()
#關閉本次操作
cur.close()
except MySQLdb.Error,e:
print "Mysql Error %d: %s" % (e.args[0], e.args[1])
if __name__ == '__main__':
main()