電信集團政企項目爬蟲部分
1 項目用到的技術點有 scrapy scrapyd scrapyd-client docker docker-compose
2 需求是要爬取全國各個省級以及省會的招投標信息。之前做過浙江省級的招投標爬取,利用的scarpy的本地爬取,
這次的爬取網站多,而且要用到定時去爬取,並做到項目的架構的可複用以及可擴充。所以我在scrapy的基礎上,利用docker 以及 scrapyd的技術特點來實現
3 實現爲:項目組員,各自有各自的scrapyd服務部署端,程序在本地編寫完成,通過scrapyd-client來實現打包部署。由於項目當前只有一臺host,所以在裏面我利用docker-compose開啓了多個scrapyd的容器,每個scrapyd以不同的端口號來區分,外部訪問各個不同的scrapyd通過host的ip+不同的scrapyd的port來實現
整個架構的目錄是
drwxr-x--- 2 puaiuc users 57 Mar 10 14:17 6800
drwxr-x--- 2 puaiuc users 57 Mar 10 14:33 6801
drwxr-x--- 2 puaiuc users 57 Mar 10 14:34 6802
drwxr-x--- 2 puaiuc users 57 Mar 10 14:35 6803
drwxr-x--- 2 puaiuc users 57 Mar 10 14:35 6804
drwxr-x--- 2 puaiuc users 23 Mar 10 15:46 base_scrapy
-rw-r----- 1 puaiuc users 1062 Mar 10 14:38 docker-compose.yml
-rw-r----- 1 puaiuc users 413 Mar 10 14:11 docker-compose.yml.bak1
其中base_scrapy的Dockerfile如下
FROM python:3.6
MAINTAINER [email protected]
RUN apt-get update && apt-get upgrade -y && apt-get install -y python-pip \
&& pip install scrapy && pip install scrapyd && pip install pymongo \
&& pip install mysqlclient
建立scrapyd的基礎鏡像,推送到dockerhub上
其中 6800-6804 是針對不同scrapyd,每個scrapyd對應的不同的端口,6800目錄的結構如下,其他的目錄也是一樣的結構
-rw-r----- 1 puaiuc users 134 Mar 10 14:17 Dockerfile
-rw-r----- 1 puaiuc users 897 Mar 10 12:35 scrapyd.conf
-rwxr-x--x 1 puaiuc users 32 Mar 10 13:35 start.sh
以6800爲例,其中的Dockerfile如下
FROM yyqq188/base_scrapy
MAINTAINER [email protected]
COPY $PWD/scrapyd.conf /etc/scrapyd/scrapyd.conf
COPY $PWD/start.sh /start.sh
scrapyd.conf的內容如下,其他都一樣,重點關注bind_address 和http_port
[scrapyd]
eggs_dir = eggs
logs_dir = logs
items_dir =
jobs_to_keep = 5
dbs_dir = dbs
max_proc = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 0.0.0.0
http_port = 6800
debug = off
runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
webroot = scrapyd.website.Root
[services]
schedule.json = scrapyd.webservice.Schedule
cancel.json = scrapyd.webservice.Cancel
addversion.json = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json = scrapyd.webservice.ListSpiders
delproject.json = scrapyd.webservice.DeleteProject
delversion.json = scrapyd.webservice.DeleteVersion
listjobs.json = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus
start.sh的內容如下
#!/bin/bash
scrapyd > /dev/null
docker-compose.yml的內容如下
version: "3"
services:
scrapyd-6800:
build: ./6800
ports:
- 6800:6800
links:
- mysql-docker
command: bash /start.sh
scrapyd-6801:
build: ./6801
ports:
- 6801:6801
links:
- mysql-docker
command: bash /start.sh
scrapyd-6802:
build: ./6802
ports:
- 6802:6802
links:
- mysql-docker
command: bash /start.sh
scrapyd-6803:
build: ./6803
ports:
- 6803:6803
links:
- mysql-docker
command: bash /start.sh
scrapyd-6804:
build: ./6804
ports:
- 6804:6804
links:
- mysql-docker
command: bash /start.sh
mysql-docker:
image: "mysql:5.6"
environment:
MYSQL_ROOT_PASSWORD: abc
volumes:
- /data/mysql_data:/var/lib/mysql
ports:
- "3306:3306"
今後擴容器,只要複製出類似6800目錄這樣的節點目錄,並重新修改下docker-compose.yml文件即可
細節部分: /etc/scrapyd/scrapyd.conf 中的scrapyd.conf 名稱必須不能變,scrapyd啓動的時候默認就按照該名稱去啓動,改名的話會依舊按照該默認名稱讀取,所有名稱不要變,變的是裏面的配置內容。
後續還演進爲swarm 以及 k8s 部署。