springboot搭建租房推薦網站（更新中......）

文章目錄

簡介

由於畢業租房的時候遇到不少坑，想搞一個給剛從學校出來的同學推薦租房信息的網站，目前做出一個雛形。
github地址：https://github.com/hanjg/house
- master分支：springboot版
- ssm分支：SSM版本

主要功能

目前的功能如下：
- 持續抓取鏈家網的租房數據，包括房屋信息和小區信息。
- 展示所有租房信息。
- 推送和展示關注的房源的最新信息和爬蟲狀態。
計劃增加功能：
- 從多個網站爬取並彙總信息。
- 管理關注的小區，可以通過名稱，位置等信息設置。
- 智能推薦房源，綜合價格等因素，需要考慮到房屋來源等社會因素。

技術選型

數據庫：msyql
後臺框架：
- springboot2
- mybatis
- webmagic：爬蟲框架抓取網站的數據。
- websocket推送消息。
前臺框架：
- easy-ui：（計劃用更加流行的Bootstrap）
- jsp（繼承ssm框架的視圖，後計劃用效率更高的thymeleaf）

主要流程

webmagic抓取數據

webmagic中：
- Downloader負責下載網頁。
- Scheduler負責調度任務的。使用 url不去重 的調度器，因爲需要重複爬取數據。
- PageProcessor負責解析下載的網頁。
- Pipeline負責數據的持久化。
- ProxyPool提供代理。代理無法穩定棄用。
參考webmagic首頁,webmagic使用總結。
webmagic的配置在WebmagicConfig這個配置類中。爬取的服務實現類爲CrawlerServiceImpl。

@Configuration
public class WebmagicConfig {

    @Autowired
    private LianjiaConst lianjiaConst;
    @Autowired
    private CrawlerConst crawlerConst;

    @Autowired
    private PageProcessor pageProcessor;
    @Autowired
    private Pipeline pipeline;
    @Autowired
    private HttpClientDownloader downloader;
    @Autowired
    private Scheduler scheduler;
    @Autowired
    private ProxyPool proxyPool;

    @Bean
    public Spider spider() {
        Spider spider = us.codecraft.webmagic.Spider.create(pageProcessor);
        spider.addPipeline(pipeline);
        downloader.setProxyProvider(proxyPool);
        spider.setDownloader(downloader);
        spider.setScheduler(scheduler);
        spider.thread(crawlerConst.getThreadNum());
        return spider;
    }

    @Bean
    public Pipeline pipeline() {
        return new LianjiaDbPipeLine();
    }

    @Bean
    public Scheduler scheduler() {
        return new DuplicateQueueScheduler();
    }

    @Bean
    public PageProcessor pageProcessor() {
        LianjiaPageProcessor pageProcessor = new LianjiaPageProcessor(crawlerConst.getSleepTimes(),
                crawlerConst.getRetryTimes());
        pageProcessor.setCityRentRoot(lianjiaConst.getRentCityRoot());
        pageProcessor.setCity(lianjiaConst.getCityName());
        return pageProcessor;
    }

    @Bean
    public ProxyPool proxyPool() {
        return new ProxyPool();
    }

    @Bean
    public HttpClientDownloader httpClientDownloader() {
        return new HttpClientDownloader();
    }
}

記錄狀態的更新

房屋和小區均使用狀態字段status標誌記錄的狀態，分別爲過期、最新、正在更新狀態。

 status TINYINT NOT NULL DEFAULT 2,

public enum RecordStatus {
    EXPIRED((byte) 0, "過期"), LATEST((byte) 1, "最新"), UPDATING((byte) 2, "正在更新");

    private Byte status;
    private String state;
}

主線程管理爬蟲，負責爬取數據，新插入的記錄或者曾經出現過的記錄都會將狀態設爲正在更新 。
更新狀態線程負責在爬取結束之後將正在更新狀態轉爲最新，曾經最新的狀態轉爲過期。

  <update id="updateStatus">
    update community
    set status = status - 1
    where status != 0
  </update>
  <update id="updateStatus">
    update renting_house
    set status = status - 1
    where status != 0
  </update>

兩個線程之間在SpiderThreadManager中使用CountDownLatch協調，保證更新線程在爬取結束之後進行。CountDownLatch使用詳解。
- 更新線程等待爬取結束。
- 爬取結束之後喚醒更新線程，主線程等待更新結束。
- 更新結束之後喚醒主線程。

  public void start(final int repeatTimes, List<String> rootUrls) {
        spiderRunnnig = true;
        int count = repeatTimes;
        while (count-- > 0) {
            updateStatusThreadStart();
            crawlStart(rootUrls);
            try {
                Thread.sleep(10 * 1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        spiderRunnnig = false;
    }

    private void crawlStart(List<String> urlList) {
        try {
            spider.addUrl(urlList.toArray(new String[urlList.size()]));
            spider.start();
            while (true) {
                Thread.sleep(10 * 1000);
                if (spider.getStatus().equals(Status.Stopped)) {
                    break;
                }
            }
            crawlAction.countDown();
            updateStatusAction.await();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }

   private void updateStatusThreadStart() {
        threadPool.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    crawlAction.await();

                 ...

                    updateStatusAction.countDown();
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        });
    }

信息的推送

使用 Websocket 維持網頁和瀏覽器的長連接，當用戶打開或者刷新網頁時，推送更新的信息至瀏覽器。
MyWebSocketHandler重寫AbstractWebSocketHandler方法，在連接建立時和接收消息時返回最新的信息。

    public void afterConnectionEstablished(WebSocketSession session) throws Exception {
        LOGGER.info("websocket connection established......");
        sendLatestNews(session);
    }

    private void sendLatestNews(WebSocketSession session) throws IOException {
        List<RentingHouse> houseList = new ArrayList<>();
        for (String communityName : pusherConst.getPushedCommunities()) {
            houseList.addAll(rentingHouseService.getLatestFavourateHouseList(communityName, lastPushTime));
        }
        String crawlerMessage = getSpiderMessage();
        String houseMessage = getHouseMessage(houseList);
        session.sendMessage(new TextMessage(crawlerMessage + "\n\n" + houseMessage));
        //更新最近推送時間
        lastPushTime = new Date();
        LOGGER.info("last push time: {}", lastPushTime);
    }
    @Override
    public void handleMessage(WebSocketSession session, WebSocketMessage<?> message) throws Exception {
        LOGGER.info("websocket handle text message: {}", message);
        sendLatestNews(session);
    }

websocket參考：詳解教程，websocket整合spring

遇到的問題

No runnable methods

單元測試報java.lang.Exception: No runnable methods
在src/test/java文件夾下的類中方法添加** @Test註解** 或者將類設置成 abstract 。

net::ERR_CONNECTION_REFUSED

連接被拒絕，原因有多種。本人遇到磁盤空間耗盡，nginx無法寫緩存，從而拒絕連接。
解決思路：查看nginx或者tomcat 日誌，找到對應request的日誌。

爬取速度慢

同一個IP最快可以一秒訪問網站兩次，否則會被封，解決這一問題最通用的方法是使用代理。
ProxyServiceImpl中抓取西刺等代理的IP，並且序列化保存在本地，以供爬蟲使用。但是抓取的代理極不穩定，驗證可用之後使用絕大多數都無法再次訪問。由於總記錄暫時爲1W-2W，平均2h刷新一次，暫時不使用代理。

    private void getProxyFromXici() {
        int currentPage = 1;
        int urlCount = 0;
        while (true) {
            String url = crawlerConst.getXiciRoot() + currentPage;
            LOGGER.info("get proxy from: {}", url);
            try {
                Document document = Jsoup.connect(url).timeout(3 * 1000).get();
                Elements trs = document.getElementsByTag("tr");
                if (trs == null || trs.size() < 1) {
                    break;
                }
                for (int i = 1; i < trs.size(); i++) {
                    try {
                        LOGGER.debug("get url {}", ++urlCount);
                        Elements tds = trs.get(i).getElementsByTag("td");
                        Proxy proxy = new Proxy(tds.get(1).text(), Integer.valueOf(tds.get(2).text()));
                        if (!proxyPool.contain(proxy) && canUse(proxy)) {
                            proxyPool.add(proxy);
                        }
                        try {
                            Thread.sleep(1000);
                        } catch (InterruptedException e) {
                            LOGGER.error(e.toString());
                        }
                    } catch (Exception e) {
                        LOGGER.error(e.toString());
                    }
                }
            } catch (Exception e) {
                LOGGER.error(e.toString());
            }
            currentPage++;
        }
    }

參考驗證代理可用。

httpclient超時

需要設置兩個超時時間間隔。connectTimeout是鏈接建立的時間，socketTimeout是等待數據的時間或者兩個包之間的間隔時間。


    public static boolean isConnServerByHttp(String serverUrl) {// 服務器是否開啓
        boolean connFlag = false;
        URL url;
        HttpURLConnection conn = null;
        try {
            url = new URL(serverUrl);
            conn = (HttpURLConnection) url.openConnection();
            conn.setConnectTimeout(3 * 1000);
            if (conn.getResponseCode() == 200) {// 如果連接成功則設置爲true
                connFlag = true;
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            conn.disconnect();
        }
        return connFlag;
    }

httpclient請求之後一定要 close鏈接 ，否則再次請求會卡住。
程序中最好設置connectTimeout、socketTimeout，可以防止阻塞。
- 如果不設置connectTimeout會導致，建立tcp鏈接時，阻塞，假死。
- 如果不設置socketTimeout會導致，已經建立了tcp鏈接，在通信時，發送了請求報文，恰好此時，網絡斷掉，程序就阻塞，假死在那。
有時，connectTimeout並不像你想的那樣一直到最大時間
socket建立鏈接時，如果網絡層確定不可達，會直接拋出異常，不會一直到connectTimeout的設定值。參考。

TIMESTAMP column with CURRENT_TIMESTAMP

只能有一個帶CURRENT_TIMESTAMP的timestamp列存在。參考。

nginx域名帶_字符非法

配置upstream的不使用 _ 。

    upstream local_tomcat {  
        server localhost:8080;
    } 
	改爲
    upstream localTomcat {  
        server localhost:8080;
    }

logback與slf4j的jar衝突

tomcat啓動時異常。該異常的原因是Springboot本身使用logback打印日誌，但是項目中其他的組件依賴了slf4j，這就導致了logback與slf4j的jar包之間出現了衝突。

Exception in thread "main" java.lang.IllegalArgumentException: LoggerFactory is not a Logback LoggerContext but Logback is on the classpath. Either remove Logback or the competing implementation

兩個jar包二選一：
排除slf4j,每個依賴了slf4j的組件都需要加如下標籤排除。

	<dependency>
	    <groupId>org.springframework.boot</groupId>
	    <artifactId>spring-boot-starter-log4j</artifactId>
	    <version>1.3.8.RELEASE</version>
	    <exclusions>
	        <exclusion>
	            <groupId>org.slf4j</groupId>
	            <artifactId>slf4j-log4j12</artifactId>
	        </exclusion>
	    </exclusions>
	</dependency>

排除logback。

    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-web</artifactId>
      <exclusions>
        <!--log4j和logback衝突，幹掉logback-->
        <exclusion>
          <groupId>org.springframework.boot</groupId>
          <artifactId>spring-boot-starter-logging</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

參考。

cookie reject 告警

當程序中無需傳遞cookie值時會出現“Cookie rejected”的警告信息。

  2018-12-12 18:18:25 [WARN]-[org.apache.http.client.protocol.ResponseProcessCookies] Cookie rejected [select_city="320100", version:0, domain:zufangzi.com, path:/, expiry:Thu Dec 13 18:18:25 CST 2018] Illegal 'domain' attribute "zufangzi.com". Domain of origin: "nj.lianjia.com"

如使用httpclient，忽略cookie即可,參考。

RequestConfig globalConfig = RequestConfig.custom().setCookieSpec(CookieSpecs.IGNORE_COOKIES).build();  
CloseableHttpClient client = HttpClients.custom().setDefaultRequestConfig(globalConfig).build();  
HttpGet request = new HttpGet(url);  
CloseableHttpResponse response = client.execute(request);

如使用webmagic，分析源碼，需要設置site的disablecookiemanagement這個屬性。

    public LianjiaPageProcessor(int sleepTime, int retryTimes) {
        this.site = Site.me().setRetryTimes(retryTimes).setSleepTime(sleepTime).setDisableCookieManagement(true);
    }

springboot和ssm區別

默認不支持jsp，需要添加的話，參考：springboot項目添加jsp支持
mybatis的整合
- 無需手動配置sqlSessionFactory，自動配置的factory可以應對大多數情況，否則某些自動配置的factory加載不了yml配置，如:

mybatis:
  mapper-locations: classpath:mapper/*.xml

從SpringMVC遷移到Springboot

springboot2和之前版本的區別

版本要求：
- java8以上
- Tomcat升級至8.5
- Flyway升級至5
- Hibernate升級至5.2
- Thymeleaf升級至3
配置屬性

springboot搭建租房推薦網站（更新中......）

文章目錄

簡介

主要功能

技術選型

主要流程

webmagic抓取數據

記錄狀態的更新

信息的推送

遇到的問題

No runnable methods

net::ERR_CONNECTION_REFUSED

爬取速度慢

httpclient超時

TIMESTAMP column with CURRENT_TIMESTAMP

nginx域名帶_字符非法

logback與slf4j的jar衝突

cookie reject 告警

springboot和ssm區別

springboot2和之前版本的區別

Innodb索引

ZAB——Paxos到Zookeeper（一）

Innodb基礎

ZK內部實現——Paxos到Zookeeper（二）

單個服務容量優化思路

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結