使用正則表達式，從網站上獲取指定數據

最近做的一個項目中，其中有這樣一個需求：用戶要求我們實時在地圖上顯示某些指定景點的人數，但是卻沒有給我們數據的接口。不過可以從網頁上獲取到最新的數據，每小時更新一次。所以經理安排我做一個實時從網頁上抓取數據的功能。

既然是網頁，那麼無用的數據肯定是非常多的，所以就需要用正則表達式來過濾出自己所需要的數據。

不得不說，正則表達式比substring好用多了，而且效率也很不錯。下面來分享一下我的這段代碼吧：

[java]view
plaincopy

/** 

 * 從網站獲取日期信息 

 *  

 * @Title: getDate 

 * @Date : 2014-8-12 上午09:42:26 

 * @return 

 */  

private String getDate() {  

    // 從網站抓取數據  

    String table = catchData();  

    String date = "";  

    // 使用正則表達式，獲取對應的數據  

    Pattern places = Pattern.compile("(<p align=\"center\">)([^\\s]*)");  

    Matcher matcher = places.matcher(table);  

    while (matcher.find()) {  

        System.out.println(matcher.group(2));  

        date = matcher.group(2);  

    }  

    return date;  

}  

/** 

 * 從網站抓取數據（未經處理） 

 *  

 * @Title: getData 

 * @Date : 2014-8-12 上午09:34:30 

 * @return 

 */  

@SuppressWarnings("unchecked")  

private String catchData() {  

    String table = "";  

    try {  

        Map map = new HashMap();  

        map.put("a", "1");// 莫刪，否則報錯  

        table = AsyncRequestUtil.getJsonResult(map, "http://s.visitbeijing.com.cn/flow.php");  

    } catch (Exception e) {  

        e.printStackTrace();  

    }  

    return table;  

}

【AsyncRequestUtil.java】

[java]view
plaincopy

package com.zhjy.zydc.util;  

import java.util.Map;  

/** 

 * 異步請求數據 

 * @author      : Cuichenglong 

 * @group       : tgb 

 * @Version     : 1.00 

 * @Date        : 2014-5-28 上午09:54:20 

 */  

public class AsyncRequestUtil {  

    /** 

     * 異步請求數據 

     * @Title: getJsonResult 

     * @param map 

     * @param strURL 

     * @return 

     */  

    public static String getJsonResult(Map<String, Object> map, String strURL)throws Exception {  

        /** 跨域登錄，獲取返回結果  **/  

        String result = null;  

        result = UrlUtil.getDataFromURL(strURL, map);  

        if (result!=null && result.startsWith("null{")) {  

            result = result.substring("null".length());  

        }  

        return result;  

    }  

}

【UrlUtil .java】

[java]view
plaincopy

package com.zhjy.zydc.util;  

import java.io.BufferedReader;  

import java.io.InputStreamReader;  

import java.io.OutputStreamWriter;  

import java.io.UnsupportedEncodingException;  

import java.net.URL;  

import java.net.URLConnection;  

import java.net.URLDecoder;  

import java.net.URLEncoder;  

import java.util.ArrayList;  

import java.util.Enumeration;  

import java.util.HashMap;  

import java.util.Iterator;  

import java.util.List;  

import java.util.Map;  

import java.util.Set;  

import javax.servlet.http.HttpServletRequest;  

import javax.servlet.http.HttpSession;  

/** 

 * url跨域獲取數據 

 * @author      : Cuichenglong 

 * @group       : Zhong Hai Ji Yuan 

 * @Version     : 1.00 

 * @Date        : 2014-5-27 下午04:14:26 

 */  

public final class UrlUtil {  

    /** 

     * 根據URL跨域獲取輸出結果 

     * @Title: getDataFromURL 

     * @param strURL 要訪問的URL地址 

     * @param param 參數 

     * @return 結果字符串 

     * @throws Exception 

     */  

    public static String getDataFromURL(String strURL, Map<String, Object> param) throws Exception{  

        URL url = new URL(strURL);  

        URLConnection conn = url.openConnection();  

        conn.setDoOutput(true);   

        conn.setConnectTimeout(5000); //允許5秒鐘的延遲：連接主機的超時時間（單位：毫秒）  

        conn.setReadTimeout(5000); //允許5秒鐘的延遲 ：從主機讀取數據的超時時間（單位：毫秒）  

        OutputStreamWriter writer = new OutputStreamWriter(conn.getOutputStream());  

        final StringBuilder sb = new StringBuilder(param.size() << 4); // 4次方  

        final Set<String> keys = param.keySet();  

        for (final String key : keys) {  

            Object value = param.get(key);  

            sb.append(key); // 不能包含特殊字符  

            sb.append('=');  

            //如果格式爲String類型，則進行2次解碼、2次編碼操作  

            if (value instanceof String) // String  

            {  

//              value = (URLDecoder.decode(URLDecoder.decode((String)value, "utf-8"),  

//                      "utf-8"));  

//              value = (Object)(URLEncoder.encode(URLEncoder.encode((String)value, "utf-8"),  

//                      "utf-8"));  

                value = URLEncoder.encode((String)value, "utf-8");  

            }  

            sb.append(value);  

            sb.append('&');  

        }  

        // 將最後的 '&' 去掉  

        sb.deleteCharAt(sb.length() - 1);  

        // writer.write("[email protected]&password=123");  

        writer.write(sb.toString());  

        writer.flush();  

        writer.close();  

        InputStreamReader reder = new InputStreamReader(conn.getInputStream(), "utf-8");  

        BufferedReader breader = new BufferedReader(reder);  

        // BufferedWriter w = new BufferedWriter(new FileWriter("d:/1.txt"));  

        String content = null;  

        String result = null;  

        while ((content = breader.readLine()) != null) {  

            result += content;  

        }  

        return result;  

    }  

}

   這是一段很簡單的代碼。是從http://s.visitbeijing.com.cn/flow.php網址中抓取日期的代碼。
       其實只有2步，第一步是通過java.net.URL後臺訪問指定網址，並且拿到頁面的html源碼。第二步從html源碼中通過正則表達式，獲取日期。這裏說一下正則表達式：

[java]view
plaincopy

Pattern places = Pattern.compile("(<p align=\"center\">)([^\\s]*)");  

       其中^表示屏蔽，\s表示空格，所以^\\s表示屏蔽空格，這裏第一個\是轉移字符。*表示匹配多個字符。

       通過matcher，獲取匹配的值。在一次matcher.find()獲取每一次匹配的數據，一個括號對應一個group。如果取matcher.group(1)，則會取到<p align="center">。matcher.group(2)可以取到匹配的值。

      瞭解了這些，我們就可以隨意的從某些網址上抓取數據了。有了數據，還有什麼可以擋住我們前進的腳步！！！

使用正則表達式，從網站上獲取指定數據

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

使用正則表達式，從網站上獲取指定數據

android Task 任務解析

listview的幾個重要屬性

記一次生產環境Nginx 502 bad gateway問題分析解決過程

使用QQ賬號，新浪微博賬號登錄第三方應用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結