使用selenium爬取淘寶實戰.md

##使用selenium爬取淘寶實戰

確定要爬取的內容

爬取左側的一級類型(女裝/男裝/內衣等等),和右側的二級類型(秋上新/連衣裙等等)

導入selenium

在這之前需要安裝webdriver

可以自行去百度安裝

 from selenium import webdriver
 from selenium.webdriver.common.by import By
 from selenium.webdriver.common.keys import Keys
 from selenium.webdriver.support.wait import WebDriverWait
 from selenium.webdriver.chrome.options import Options
 from selenium.webdriver.support.ui import WebDriverWait#負責等待
 # expected_conditions 類，負責條件出發
 from selenium.webdriver.support import expected_conditions as EC
 import time
 from selenium.webdriver.common.action_chains import ActionChains
 import time

使用selenium的無頭模式

 chrome_options=Options()
 chrome_options.add_argument('--enable-logging')
 chrome_options.add_argument('--disable-gpu')
 chrome_options.add_argument('--headless ')

創建一個模擬的瀏覽器並設置無頭模式

 browser=webdriver.Chrome()

確定爬取的網址

 url="https://www.taobao.com/"

發送請求

 browser.get(url)

使用xpath抓取標籤

 input_kw=browser.find_elements_by_xpath('//ul[@class="service-bd"]/li')

這裏會遇到一個問題,拿右側的標籤,如果鼠標不經過它所對應的左側標籤,頁面是不會去請求後臺拿取數據的,所有需要去循環觸碰左側標籤,使用如下方法

 ##ActionChains(browser).move_to_element(i).perform()
 
 for i in input_kw:
     print(i)
     li_a=i.find_elements_by_xpath('a')
     for li in li_a:
         print(li.get_attribute('innerHTML'))
         data.append({"name":li.get_attribute('innerHTML')})
     ActionChains(browser).move_to_element(i).perform()
     time.sleep(3)

拿取所有的右側標籤

 items=browser.find_elements_by_xpath('//div[@class="service-panel"]')

 index=0
 for item in items:
     good=item.find_elements_by_xpath('p/a')
     h=item.find_element_by_xpath('h5/a')
     data[index]["href"]=h.get_attribute('href')
     print("good",good)
     print("good的長度",len(good))
     item_list=[]
     for g in good:
         print("g",g)
         print(g.get_attribute('innerHTML'))
         item_list.append(g.get_attribute('innerHTML'))
     data[index]["category"]=item_list
     index+=1

關閉selenium的瀏覽器

 browser.close()

到這一步,我們拿到了所有商品類型,數據格式應該是這樣的:

 [
   {
     "name": "辦公",
     "href": "https://www.taobao.com/markets/bangong/pchome",
     "category": [
       "WiFi放大器",
       "無線呼叫器",
       "格子間",
       "電腦桌",
       "辦公椅",
       "理線器",
       "計算器",
       "熒光告示貼",
       "翻譯筆",
       "毛筆",
       "馬克筆",
       "文件收納",
       "本冊",
       "書寫工具",
       "文具",
       "畫具畫材",
       "鋼筆",
       "中性筆",
       "財會用品",
       "碎紙機",
       "包裝設備"
     ]
   },
   {
     "name": "DIY",
     "href": "https://www.taobao.com/markets/dingzhi/home",
     "category": [
       "定製T恤",
       "文化衫",
       "工作服",
       "衛衣定製",
       "LOGO設計",
       "VI設計",
       "海報定製",
       "3D效果圖製作",
       "廣告扇",
       "水晶獎盃",
       "胸牌工牌",
       "獎盃",
       "徽章",
       "洗照片",
       "照片沖印",
       "相冊/照片書",
       "軟陶人偶",
       "手繪漫畫",
       "紙箱",
       "搬家紙箱",
       "膠帶",
       "標籤貼紙",
       "二維碼貼紙",
       "塑料袋",
       "自封袋",
       "快遞袋",
       "氣泡膜",
       "編織袋",
       "飛機盒",
       "泡沫箱",
       "氣柱袋",
       "紙手提袋",
       "打包繩帶",
       "氣泡信封",
       "纏繞膜"
     ]
   }]

 這只是數據的一部分

爬取每個商品類型對應的所有的商品,這裏只拿到5頁,不貪多

這次我們採用面向對象的思想來做:

創建一個taobaoSpider類

將那些固定的數據寫在__init__方法中,我們把數據存在mongodb裏面,當然也可以存在本地

     def __init__(self,file):
         self.conn = MongoClient('127.0.0.1', 27017)
         self.db = self.conn.orsp  # 連接mydb數據庫，沒有則自動創建
         # 使用無頭模式
         chrome_options = Options()
         chrome_options.add_argument('--enable-logging')
         chrome_options.add_argument('--disable-gpu')
         chrome_options.add_argument('--headless ')
         # 修改頭
         # dcap = dict(DesiredCapabilities.PHANTOMJS)
         # dcap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Linux; U; Android 4.1; en-us; GT-N7100 Build/JRO03C) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"
 
         # options=chrome_options
         self.browser = webdriver.Chrome()
 #       類型文件
         self.file=file

因爲淘寶的商品頁面也是動態刷新的,所有需要定義一個滾動屏幕的方法

 def roll_window(self):
     for ti in range(10):
         self.browser.execute_script("window.scrollBy(0,600)")
         time.sleep(0.5)

打開剛纔爬取的所有類型文件,並去調用爬蟲方法

 def open_file(self):
     with open('taobao_type.json', 'r', encoding='utf-8') as f:
         data = json.load(f)
         # 讀取data的每一項
         for i in data:
             # 獲得url地址
             url = i["href"]
             self.belong_name = i["name"]
             for item in i["category"]:
                 self.item=item
                 # 獲取到點擊按鈕
                 try:
                     # 使用瀏覽器獲取url的界面
                     self.browser.get(url)


                     time.sleep(1)
                     # 找到輸入框
                     input_kw = self.browser.find_element_by_xpath('//div[@class="search-combobox-input-wrap"]/input')
                     # 每次都清空輸入框
                     input_kw.clear()
                     # 寫入要找的值
                     input_kw.send_keys(item)
                     # 按下回車
                     input_kw.send_keys(Keys.ENTER)
                 except Exception as ex:
                     print(ex)
                     continue
                 # 使界面滾動條勻速下滑,使得頁面全部渲染出來
                 taobao_spider.roll_window()
                 for num in range(5):
                     try:
                         # 第一次時,不需要點擊下一頁
                         if num != 0:
                             # 獲取下一頁按鈕
                             next_step = self.browser.find_element_by_xpath('//ul/li[@class="item next"]')
                             next_step.click()
                             time.sleep(1)
                             # 滾動條勻速滾動
                             taobao_spider.roll_window()
                         # 獲取所有商品項,其中有幾個頁面xpath不一樣,使用 or 可以解決
                         goods = self.browser.find_elements_by_xpath(
                             '//div[@class="item J_MouserOnverReq  item-sku J_ItemListSKUItem"]') or self.browser.find_elements_by_xpath(
                             '//div[@class="item J_MouserOnverReq  "]')
                         # 遍歷所有商品
                         taobao_spider.get_all_goods(goods)
                     except Exception as ex:
                         print(ex)

爬蟲的主體部分

 #   獲取每一個商品
 def get_all_goods(self,goods):
     for good in goods:
         taobao = {}
         good_item = good.find_element_by_xpath('.//div[@class="pic"]/a')
         # 獲取id
         id = good_item.get_attribute('data-nid')
         # 獲取詳情頁的地址
         taobao["detail_href"]=good_item.get_attribute('href')
         # 獲取價格
         price = good_item.get_attribute('trace-price')
         img = good.find_element_by_xpath('.//div[@class="pic"]/a/img')
         # 獲取title
         title = img.get_attribute('alt')  # 獲取alt屬性
         img_href = img.get_attribute('src')  # 獲取src屬性
         shop = good.find_element_by_xpath('.//div[@class="shop"]/a/span[2]').get_attribute(
             'innerHTML')  # 獲取innerHTML,有時  元素.text  這樣的形式不好用
         address = good.find_element_by_xpath('.//div[@class="location"]').get_attribute(
             'innerHTML')
         sales_num = good.find_element_by_xpath('.//div[@class="deal-cnt"]').get_attribute(
             'innerHTML')
         taobao["belong_to"] = self.item
         taobao["sales_num"] = sales_num
         taobao["price"] = price
         taobao["img_href"] = img_href
         taobao["title"] = title
         taobao["shop"] = shop
         taobao["address"] = address
         taobao["belong_name"] = self.belong_name
         print(",", taobao)
         # 將數據插入mongodb
         my_set = self.db.taobao_goods.insert(taobao)

最後在main中,創建taobaoSpider類對象,開始爬蟲

 if __name__ == '__main__':
     taobao_spider=taobaoSpider('taobao_type.json')
     taobao_spider.open_file()

使用selenium爬取淘寶實戰.md

解決sentry禁用qq郵箱的問題(docker安裝)

Jenkins自動化部署最全面教程

spa意義及原理

Python項目-Day36-js-正則表達式-BOM-DOM

scrapy 爬蟲生成行業技術趨勢詞雲圖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結