數據採集 - 1688公開信息採集案例展示

一、背景：

一個做展會的小夥伴兒找到我，希望能幫他採集某一類目的1688廠家信息，然後邀請他們參加展會。

二、設計思路如下：

採用 Python3 語言編碼 , 工具 PyCharm;
模仿真實用戶登錄1688，使用Selenium + Google Chrome + chromedriver.exe；

備註1：Google Chrome + chromedriver.exe版本對應參考鏈接：https://blog.csdn.net/lildkdkdkjf/article/details/106871954

備註2：Selenium是一個用於Web應用程序測試的工具。Selenium測試直接運行在瀏覽器中，就像真正的用戶在操作一樣。支持的瀏覽器包括IE（7, 8, 9, 10, 11），Mozilla Firefox，Safari，Google Chrome，Opera等。

對1688的超頻次訪問限制策略，採用sleep等待重試的策略解決。
以excel表格的形式保存結果。

三、功能實現：

關鍵代碼

 def get_url_list(self):
        beginPage = 1
        while beginPage < 100:
            try:
                httpDone = ('http://s.1688.com/company/company_search.htm?n=y&netType=1,11&encode=utf-8&keywords=%s&beginPage=%d') % (
                           self.keyword_encode, beginPage)

                print("parsettt 頁碼", beginPage, httpDone)
                self.browser.get(httpDone)
                nodes = self.browser.find_elements_by_xpath('//a[@class="list-item-title-text"]')
                if len(nodes) == 0:
                    print("parsettt 未找到節點------------------", beginPage)
                    if self.browser.page_source.find("滑動一下馬上回來") >=0 :
                        seconds = random.randint(self.min_seconds, self.max_seconds)
                        print("parsettt sleep s，程序被限制，滑動一下馬上回來 ", seconds, beginPage)
                        time.sleep(seconds)
                        continue
                    else :
                        print("parsettt 結束 exit---------------------", beginPage)
                        break
                else:
                    self.url_list = []
                    print("parsettt 找到節點----------------", len(nodes), len(self.url_list))
                    for node in nodes:
                        url = node.get_attribute('href')
                        title = node.get_attribute('title')
                        # 去重處理
                        if url not in self.url_list:
                            self.url_list.append(url)

                    for url in self.url_list:
                        self.save_gys_info(url)

                beginPage = beginPage + 1
                print("////////////////////////////////////////")
            except Exception as e:
                print("error", e)
                time.sleep(30)

配置文件

{
    "chrome": "",
    "chromedriver": "chromedriver.exe",
    "keyword": "服裝",
    "min_seconds": 600,
    "max_seconds": 720
}

結果展示

四、總結

節前實現了這個功能，交付給了小夥伴兒，他很滿意，因爲他節省了時間和精力去做更有創造力的事情。

本次分享結束，歡迎討論！QQ微信同號： 6550523

本文章僅供技術交流，不得商用，不得轉載，違者必究。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據採集 - 1688公開信息採集案例展示

一、背景：

二、設計思路如下：

三、功能實現：

關鍵代碼

配置文件

結果展示

四、總結

ECharts - 19. echarts-liquidfill 水球圖

數據採集 - 獲取【一品威客】最新發布需求，並實時通知用戶案例四

數據採集 - 獲取【oschina】最新發布需求，並實時通知用戶案例三

數據可視化：基於 Echarts + Python 實現的動態實時大屏範例五

數據可視化：基於 Echarts + Python 實現的動態實時大屏範例六

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結