NodeJS簡單爬蟲-資源下載

原創

2018-12-28 16:12

1、需求

扒取當前網頁的請求資源，包括：js、css、

2、環境及使用工具

NodeJS     puppeteer    Gulp

3、環境的搭建

3-1、安裝nodejs

爲了支持async和await，我們需要安裝較高版本的node,具體安裝過程查看官網：(https://nodejs.org/zh-cn/download/)

3-2、安裝puppeteer

MAC用戶建議使用yarn來安裝，安裝方式如下：[在安裝的過程中，會默認安裝chromium,如果安裝不成功，可以跳過該安裝過程，自己手動安裝chromium]

  yarn add puppeteer
  //或者
  npm install puppeteer

3-3、安裝gulp

需要全局安裝gulp

npm install --save-dev -g gulp

4、爬蟲代碼

4-1、請求資源的獲取

const links = [];
const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto(url);//打開網頁

const allRequests = new Map();

await page.on('request', req => {
    //獲取請求資源
    links.push(req._url);
    allRequests.set(req.url(), req);
});

await page.reload({waitUntil: 'networkidle0'})
await browser.close();

4-2、資源的下載

const download = require('gulp-downloader');
gulp.task('download', function () {
    return new Promise(function (resolve, reject) {
        download(url).pipe(gulp.dest('./dist'));//將請求到的資源下載到指定的目錄下
    })
})

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NodeJS簡單爬蟲-資源下載

1、需求

2、環境及使用工具

3、環境的搭建

3-1、安裝nodejs

3-2、安裝puppeteer

3-3、安裝gulp

4、爬蟲代碼

4-1、請求資源的獲取

4-2、資源的下載

容器中nginx無法使用同一個網絡下的容器域名

Python: SunMoonTimeCalculator

NETCore中實現一個輕量無負擔的極簡任務調度ScheduleTask

docker使用特定的網絡

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

nodejs學習07——API

避免DbContext同時在多個線程調用

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

數組操作函數封裝

常見字符串操作函數封裝

前端面試-JavaScript篇

let、const、var的區別

上中下佈局【不滿一屏，footer固定在底部，滿一屏，footer隨頁面滾動】

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結