Nodejs 爬虫Github项目汇总

原創

2020-06-10 21:11

Nodejs 爬虫Github项目汇总

DistributedCrawler

博客地址

nodejs_crawler

node.js主从分布式爬虫

采用Redis为任务队列服务
主程序获取任务
从程序获得数据并下载
通过代理接口获取数据

headless-chrome-crawler

特征

分布式抓取
配置并发性，延迟和重试
支持深度优先搜索和广度优先搜索算法
可插拔的缓存存储，如Redis
支持导出结果的CSV和JSON行
在最大请求时暂停并在任何时候恢复
自动插入jQuery进行刮取
保存抓取证据的截图
仿真设备和用户代理
优先队列提高抓取效率
服从robots.txt
跟随sitemap.xml

x-ray

基于Node.js 的HTML 内容抓取工具。

特点：

灵活的架构：支持字符串，数组，对象数组和嵌套对象结构。
可组合： API完全可组合，为您提供了每页抓页的灵活性。
分页支持：通过网站分页，抓取每一页。
抓取工具支持：从一页开始，轻松移动到下一页。在广度优先爬行每个页面之后，流程是可预测的。
负责任：支持并发，限制，延迟，超时和限制，以帮助您负责任地抓取任何页面。
可插拔驱动程序：根据您的需要切换不同的

node-crawler

1）node-crawler逻辑是利用bottleneck任务调度器，将接收到的多个url当作多个任务添加至队列进行执行。
2）使用cheerio、jsdom或者whacko解析html。

博客

Floodesh

node-crawler的分布式版本 floodesh ，即，将crawler维护的queue 改为分布式DB MongoDB，增加了主机index与客户端worker，分别负责任务调度与爬取工作。

supercrawler

自动爬取网页，维护一个队列(FIFO, db, redisDb)。可自定义处理器解析content。遵循robots.txt、速率和并发限制。

js-crawler

按照depth爬取以及确定何时停止。

支持 HTTP and HTTPS 协议。

使用Executor来限制任务处理速率（==未理解）。

爬取时使用3个队列：

1）knownUrls：已经访问过的Url ，格式类似于：{‘https://www.baidu.com/’: true, ‘https://tieba.baidu.com/index.html?traceid=’: false}；

2）crawledUrls：已经爬取过的Url；

3）_currentUrlsToCrawl：待爬取的Url队列。

simplecrawler

SQLite-simplecrawler-queue

boloto

easier http crawler by Node.js

roboto

bot-marvin

cnblogSpider

基于nodejs 的博客园爬虫项目.

博客

node-fetch

A light-weight module that brings window.fetch to Node.js

fetch

mercury-parser

水星项目(丽姐参考)

html-extractor

从html字符串中提取meta-data（body, title, meta-tags, h1）.

parse5

HTML parsing/serialization toolset for Node.js.

Crawlab Team

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Nodejs 爬虫Github项目汇总

Nodejs 爬虫Github项目汇总

DistributedCrawler

nodejs_crawler

headless-chrome-crawler

x-ray

node-crawler

Floodesh

supercrawler

js-crawler

simplecrawler

SQLite-simplecrawler-queue

boloto

roboto

bot-marvin

cnblogSpider

node-fetch

fetch

mercury-parser

html-extractor

parse5

Crawlab Team

爬蟲中的User-Agent和IP代理

Robot.txt和Sitemap

URL組成終於弄明白了

JS @class標籤

C++之動態內存

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結