Pyspider 框架的用法

原創

2020-06-29 05:08

Pyspider

Pyspider是国人开发的开源且强大的网络爬虫系统

python 脚本控制，可以用任何你喜欢的html解析包（内置 pyquery），WEB 界面编写调试脚本，起停脚本，监控执行状态，查看活动历史，获取结果产出，数据存储支持MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL 及 SQLAlchemy，队列服务支持RabbitMQ, Beanstalk, Redis 和 Kombu，支持抓取 JavaScript 的页面。组件可替换，支持单机/分布式部署，支持 Docker 部署。强大的调度控制，支持超时重爬及优先级设置。支持python2&3

在CentOS7 上安装pyspider。

因为Pyspider所依赖一个pycurl。而这个包比较特殊。需要使用源码安装。相关的链接http://www.linuxidc.com/Linux/2015-09/122805.htm

（链接时Ubuntu的，centos把 apt-get换成yum就好）

然后直接

pip install pyspider

另外需要再安装 phantomjs。http://phantomjs.org/

phantomjs是一个无界面的浏览器。

在centos7 上启动pyspider

[root@lol ~]# pyspider all
phantomjs fetcher running on port 25555
[I 171102 16:14:51 result_worker:49] result_worker starting...
[I 171102 16:14:51 processor:211] processor starting...
[I 171102 16:14:51 scheduler:647] scheduler starting...
[I 171102 16:14:52 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 171102 16:14:52 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 171102 16:14:52 tornado_fetcher:638] fetcher starting...
[I 171102 16:14:52 app:76] webui running on 0.0.0.0:5000

在浏览器上访问5000端口

接下来创建一个项目

下来进入到调试运行代码的界面。

接着演示一下抓取猫途鹰旅游网站的内容https://www.tripadvisor.cn/

假设我们要抓取这个网站的旅游景点信息。那么首先将被抓取网页的url填入。

修改Handler类里面的on_start方法：

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('__START_URL__', callback=self.index_page)

将里面的 __START_URL__换成要抓取的页面。点击右上角的 Save 。再点左边的 run

运行结束会得到一个url。点web有该页面的浏览。

再点击url后的箭头，会把该页面全部的a标签的 href 取出来。

没错。只要稍作修改就可以把该页面的景区详情的页面抓出来。

还是修改 Hangler类。修改 index_page方法。

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('.attraction_element .listing_title > a').items():
            self.crawl(each.attr.href, callback=self.detail_page)

根据详情页面的一些特点就可以抓出来想要的页面。

接下来是如何处理抓到页面的那些内容了。

    @config(priority=2)
    def detail_page(self, response):
        
        name = response.doc(".heading_title").text()
        rating = response.doc(".reviews_header_count").text()
        address = response.doc(".location > .address").text()
        phone = response.doc(".phone > div").text()
        introduction = response.doc(".attraction_details > div").text()
        
        return {
            "name" : name,
            "rating" : rating,
            "address" : address,
            "phone" : phone,
            "introduction" : introduction,
            "url": response.url,
            "title": response.doc('title').text(),
        }

这个利用的pyquery 。将页面相应的内容全部抓下来。

最后是一些其他的操作了。如存入到数据库，存入到文件中。

    def on_result(self,result):
        if result:
            self.saveToFile(result)
           
    def saveToFile(self,result):
        file = open("/root/result.txt","a")
        file.write(str(result)+"\n")
        file.close()

这样就存起来了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Pyspider 框架的用法

Pyspider

10分钟搞定Mysql主从部署配置

如何使用 JS 判断用户是否处于活跃状态

「Pygors跨平台GUI」2：安装MinGW-w64、MSYS2还是WSL2

[转帖]

python列出centos7内存使用前50的进程信息

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

一键自动化博客发布工具,用过的人都说好(掘金篇)

lightdb数据库超时相关控制参数

lightdb秒级增加列和删除列（not null带默认值）

Java ThreadPoolShutdown

爬蟲基礎-requests庫

爬蟲基礎-- 正則基礎

python爬蟲 -- scrapy框架

Python IP 的處理模塊

Pyspider 框架的用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結