SpringBoot任務——JSoup+定時任務定時爬取微博熱搜至數據庫

文章目錄

SpringBoot任務——JSoup+定時任務定時爬取微博熱搜至數據庫

0.前言

截至本文寫完：

微博熱搜的網址爲：https://s.weibo.com/top/summary 如有變化，自行百度微博熱搜。
HTML源碼中，熱搜數據在table標籤中，而且第一個置頂的熱搜沒有熱度。有時候還有推薦(應該是廣告)
刷新後可能會有不同結果，機率還挺大，我也是服了。

如下圖所示：

1.導入JSoup依賴

        <!--   jsoup HTML解析庫     -->
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.13.1</version>
        </dependency>

2.測試爬取微博熱搜

微博熱搜的網址： https://s.weibo.com/top/summary

Ctrl+U查看HTML源碼，可知熱搜數據以table顯示,Jsoup爬取table標籤中的內容，搜索一下便可…

我是參考這篇博客：[Java jsoup table 中獲取td和tr]( http://www.yq1012.com/myweb/2162.html )

由此可知爬取table標籤中的內容還是很簡單的，在測試類中試試，代碼加註釋如下：

    @Test
    void TestCrawlingHotSearch() {
        try {
            String urlStr = "https://s.weibo.com/top/summary";
            final Document doc = Jsoup.connect(urlStr).get();//獲取html
            Elements trs = doc.select("tbody").select("tr");//獲取tbody下的所有tr下的html內容
            for (org.jsoup.nodes.Element tr : trs) {
                Elements tds = tr.select("td");
                String rank = tds.get(0).text();//排名
                String num = tds.get(1).select("span").text();//熱度指數
                String title = tds.get(1).select("a").text();//熱搜標題
                String url = tds.get(1).select("a").attr("href");//熱搜URL網址(相對地址)
                String baseurl = "https://s.weibo.com";//和上述url組成完整可訪問的單個熱搜URL
                //以 排名+熱搜標題+熱搜指數+有效URL的形式輸出
                System.out.println(rank + " " + title + " " + num + " " + baseurl + url);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

測試結果如下：2020.03.02 21點44分爬取的微博熱搜：熱搜榜上只有50條，最上面的是置頂的，沒有熱搜指數。接下來的定時任務中把這個去掉，那個for循環從1開始即可。

28個省份恢復省際省內道路客運  https://s.weibo.com/weibo?q=%2328%E4%B8%AA%E7%9C%81%E4%BB%BD%E6%81%A2%E5%A4%8D%E7%9C%81%E9%99%85%E7%9C%81%E5%86%85%E9%81%93%E8%B7%AF%E5%AE%A2%E8%BF%90%23&Refer=new_time
1 孫楊公佈完整血樣瓶 6140146 https://s.weibo.com/weibo?q=%23%E5%AD%99%E6%9D%A8%E5%85%AC%E5%B8%83%E5%AE%8C%E6%95%B4%E8%A1%80%E6%A0%B7%E7%93%B6%23&Refer=top
2 高鑫去口罩廠做義工 3514533 https://s.weibo.com/weibo?q=%23%E9%AB%98%E9%91%AB%E5%8E%BB%E5%8F%A3%E7%BD%A9%E5%8E%82%E5%81%9A%E4%B9%89%E5%B7%A5%23&Refer=top
3 韓國新冠肺炎定點醫院16名護士辭職 2636260 https://s.weibo.com/weibo?q=%23%E9%9F%A9%E5%9B%BD%E6%96%B0%E5%86%A0%E8%82%BA%E7%82%8E%E5%AE%9A%E7%82%B9%E5%8C%BB%E9%99%A216%E5%90%8D%E6%8A%A4%E5%A3%AB%E8%BE%9E%E8%81%8C%23&Refer=top
4 余文樂 2355872 https://s.weibo.com/weibo?q=%E4%BD%99%E6%96%87%E4%B9%90&Refer=top
5 偶像失聲 2011436 https://s.weibo.com/weibo?q=%E5%81%B6%E5%83%8F%E5%A4%B1%E5%A3%B0&Refer=top
6 馬雲回贈日本100萬隻口罩 1627005 https://s.weibo.com/weibo?q=%23%E9%A9%AC%E4%BA%91%E5%9B%9E%E8%B5%A0%E6%97%A5%E6%9C%AC100%E4%B8%87%E5%8F%AA%E5%8F%A3%E7%BD%A9%23&Refer=top
....

其實關於熱搜的爬取到這就沒了，後面其實就是加了springboot中的定時任務和MybatisPlus實現了定時爬取熱搜至數據庫。
關於定時任務和MybatisPlus簡單整合可查看：
SpringBoot任務——定時任務
 SpringBoot數據訪問——整合MybatisPlus

3. 配合定時任務註解實現定時爬取至數據庫

3.1 導入依賴與配置MySQL

這裏用到了lombok插件+MybatisPlus+MySQL，導入相關依賴：

        <!--        MySQL驅動-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>

        <!--        Lombok插件-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>

        <!--        mybatis-plus 啓動器-->
        <dependency>
            <groupId>com.baomidou</groupId>
            <artifactId>mybatis-plus-boot-starter</artifactId>
            <version>3.3.1.tmp</version>
        </dependency>

在application.proterties或者新建一個application.yml配置文件，配置MySQL數據源：

默認的application.proterties

#mysql數據源配置
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://*.*.*.*:3306/數據庫名稱
spring.datasource.username=賬號
spring.datasource.password=密碼

yml格式：

#mysql數據源配置
spring:
  datasource:
    username: 賬號
    password: 密碼
    url: jdbc:mysql://*.*.*.*:3306/數據庫名稱
    driver-class-name: com.mysql.jdbc.Driver

3.2 熱搜實體類與對應的數據表

簡單方便起見，這裏都用了字符串類型…我太懶了

實體類：

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
//id 時間 排名 熱搜標題 熱搜指數 有效URL
@Data
@AllArgsConstructor
@NoArgsConstructor
public class HotSearch {
    String id;//UUID生成
    String date;//時間格式：yyyy-MM-dd HH:mm:ss
    String rank;
    String title;
    String number;
    String url;
}

數據表：

3.3 使用MyBatisPlus寫實體類對應的Mapper

使其具備基本的CRUD功能。

import com.baomidou.mybatisplus.core.mapper.BaseMapper;
import com.piao.springboot_scheduled_jsoup.Entity.HotSearch;
import org.apache.ibatis.annotations.Mapper;

@Mapper
public interface HotSearchMapper extends BaseMapper<HotSearch> {
}

3.4 @Scheduled註解實現定時執行爬取

在之前的基礎上加了時間和ID，通過hotSearchMapper將數據插入數據當中。

@Service
public class MyJSoupService {
    @Resource
    HotSearchMapper hotSearchMapper;

    //定時爬取微博熱搜 每天中午12點爬
    @Scheduled(cron ="0 50 22 * * *")
    public void CrawlingHotSearch() {
        try {
            String urlStr = "https://s.weibo.com/top/summary";
            final Document doc = Jsoup.connect(urlStr).get();
            Elements trs = doc.select("tbody").select("tr");
            for (int i = 1; i < trs.size(); i++) {
                Elements tds = trs.get(i).select("td");
                String rank = tds.get(0).text();//排名
                String num = tds.get(1).select("span").text();//熱度指數
                String title = tds.get(1).select("a").text();//標題
                String url = tds.get(1).select("a").attr("href");//熱搜詳細（標題+熱度）
                String baseurl="https://s.weibo.com";//基址
                //時間
                SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");//設置日期格式
                String date = df.format(new Date());// new Date()爲獲取當前系統時間，也可使用當前時間戳
                //id
                String hotSearchId= UUID.randomUUID()+"";
                //以 id+時間+排名+熱搜標題+熱搜指數+有效URL的形式輸出
                System.out.println("id:"+hotSearchId+" 時間"+date+" 排名"+rank+" 標題"+title+" 熱搜指數"+num+" 有效URL"+baseurl+url);
                hotSearchMapper.insert(new HotSearch(hotSearchId,date,rank,title,num,baseurl+url));//插入單條熱搜數據
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

3.5 @EnableScheduling註解開啓定時任務

別忘了使用@EnableScheduling註解標註在類上開啓定時任務，這裏標註在的啓動類上：

@EnableScheduling
@SpringBootApplication
@MapperScan("com.piao.springboot_scheduled_jsoup.mapper")
public class SpringbootScheduledJsoupApplication {

    public static void main(String[] args) {
        SpringApplication.run(SpringbootScheduledJsoupApplication.class, args);
    }

}

3.6 運行走起

我測試把時間調爲22.52執行了，結果如下：

控制檯正常輸出：

數據庫正常寫入50條熱搜數據：

SpringBoot任務——JSoup+定時任務定時爬取微博熱搜至數據庫