1、scrapy 數據爬取框架
安裝Anaconda環境
安裝依賴庫
構建項目
編寫實現代碼
2、scrapy-redis 分佈式爬取
安裝依賴
修改配置文件
多開終端模擬分佈式爬取
3、編寫Dockerfile
# Set Base OS FROM centos:7.6.1810 # Set Creater MAINTAINER xxx <[email protected]> # Install Basic Dependencies RUN yum install -y initscripts RUN yum install -y crontabs RUN yum install -y wget RUN yum install -y bzip2 # Download Anaconda Python RUN wget --quiet https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh -O ~/anaconda.sh # Install Anaconda Python RUN /bin/bash ~/anaconda.sh -b -p /opt/conda # Remove Anaconda SH File RUN rm ~/anaconda.sh # Add Env RUN echo "export PATH=/opt/conda/bin:$PATH" >> ~/.bashrc #安裝爬蟲依賴 RUN /opt/conda/bin/conda install -y scrapy RUN /opt/conda/bin/pip install scrapy-redis RUN /opt/conda/bin/conda install -y pymongo # Set timezone RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime # Set crontab task COPY crontab var/spool/cron/root # Set locale ENV LANG C.UTF-8 LC_ALL=C.UTF-8
4、構建容器(未做版本管理)
#!/bin/bash result=$(docker ps | grep "spider_car") if [[ "$result" != "" ]] then echo "stop spider_car" docker stop spider_car fi result1=$(docker ps -a | grep "spider_car") if [[ "$result1" != "" ]] then echo "rm spider_car" docker rm spider_car fi result2=$(docker images | grep "spider_car") if [[ "$result2" != "" ]] then echo "spider_car" docker rmi spider_car fi docker build -t spider_car . # 採用目錄掛載的方式,這樣不不需要每次都重新構建鏡像 docker run -dit --privileged --name spider_car -v /root/spider/spider_py:/opt/spider_py spider_car /usr/sbin/init
5、Jenkins自動化構建