中國土地市場網爬取

實習期間老闆需要對土地數據進行爬取,想來想去就Google到了中國土地網.應該算是種類最齊全的土地數據了。

Github地址:https://github.com/AnTi-anti/china_land/tree/master

目標分析

在這裏插入圖片描述
需要提取的信息就是上方最終表格的信息。不同於上一篇爬取徐州市自然資源和規劃局土地數據.這次爬取會涉及到幾個難點。

網頁結構分析

我們首先進入官網,依次點擊土地供應,結果公告。
在這裏插入圖片描述
就進入了接下來這個頁面。
在這裏插入圖片描述
因爲我們需要的是2015-2020年的數據。而且是根據行政區來依次進行爬取。所以,肯定需要用到selenium進行爬取。和之前那一篇結構類似。也是先爬取土地坐落的鏈接,然後再爬取鏈接的詳情頁面。

難點

第一種情況就是會出現禁止訪問500,,一開始用的是免費的代理池不斷更換IP,但是由於IP有限,所以爬取速度還不如沒有代理的快些。後面就放棄了使用代理的想法。所以說遇到這種情況,只能暫時中止。但是在爬取鏈接頁面的時候這種情況不會出現,而在對詳情頁進行提取的時候會出現。這時候我採取的措施是直接剔除掉已經爬取的鏈接,對剩餘沒有用到的鏈接繼續提取詳情頁信息。

一開始我使用的是自己的寬帶和局域網,經常會出現這種禁止訪問的情況。但是後來我使用了華爲的服務器竟然就沒有這種情況。而且在同一時間段,我本機和與雲服務器同時在跑,本地就禁止,而云服務器依舊運行,很奇怪,如果有懂硬件的朋友瞭解的話,可以評論區告訴我,我也不知道自己猜的對不對。
在這裏插入圖片描述
第二種情況就是會在爬取的過程中頻繁出現驗證碼的步驟。這個也不難,我們可以直接對其進行識別。如果出現驗證碼,則對其進行識別;否則,繼續進行爬取。

def img_down_load(img):
    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=MQE9mLzD9296AQQ7byq40Iud&client_secret=n1ElwPtvGTBua67hyLIPZtp5IGciGGjV'
    response = requests.get(host)
    request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
    # 二進制方式打開圖片文件
    img = img.split(",")
    params = {"image": img[1]}
    access_token = response.json()['access_token']
    request_url = request_url + "?access_token=" + access_token
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
    # headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.post(request_url, data=params, headers=headers)
    if response:
        counts = response.json()['words_result'][0]["words"]
        return counts

第三種就是隻顯示前200頁的數據。如果你是翻頁進行爬取的話,那麼只有200頁的數據可以獲取到,這時候我們就需要對時間進行篩選,在上述篩選框輸入時間即可,再啓動翻頁設置。
在這裏插入圖片描述
第四種就是行政區域的設置。我們通過對源代碼進行分析,發現可以通過輸入城市的身份證前4位進行判別。
在這裏插入圖片描述

數據爬取

第一步還是老樣子,先對鏈接進行爬取。

import requests
def img_down_load(img):
    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=MQE9mLzD9296AQQ7byq40Iud&client_secret=n1ElwPtvGTBua67hyLIPZtp5IGciGGjV'
    response = requests.get(host)
    request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
    # 二進制方式打開圖片文件
    img = img.split(",")
    params = {"image": img[1]}
    access_token = response.json()['access_token']
    request_url = request_url + "?access_token=" + access_token
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
    # headers = {'content-type': 'application/x-www-form-urlencoded'}
    response = requests.post(request_url, data=params, headers=headers)
    if response:
        counts = response.json()['words_result'][0]["words"]
        return counts
import time,json,random
from test import img_down_load
from selenium import webdriver
opt = webdriver.ChromeOptions()
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
opt.add_argument('--user-agent=%s' % headers)
#opt.add_argument("--proxy-server=http://202.20.16.82:10152")
driver = webdriver.Chrome(options=opt)
driver.get("https://www.landchina.com/default.aspx?tabid=263&ComName=default")
time.sleep(random.randint(2,5))
while True:
    if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
        img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
        a = img_down_load(img)
        driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
        driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
        time.sleep(random.randint(2,5))
    try:
        driver.find_element_by_id("TAB_QueryConditionItem256").click()
        driver.execute_script("document.getElementById('TAB_queryTblEnumItem_256_v').setAttribute('type', 'text');")
        driver.find_element_by_id('TAB_queryTblEnumItem_256_v').clear()
        driver.find_element_by_id("TAB_queryTblEnumItem_256_v").send_keys('3203')
        driver.find_element_by_id("TAB_QueryButtonControl").click()
        break
    except:
        driver.refresh()
        continue
list_info = []

time.sleep(random.randint(2,5))
try:
    if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
        img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
        a = img_down_load(img)
        driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
        driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
        driver.find_element_by_id("TAB_QueryConditionItem256").click()
        driver.execute_script("document.getElementById('TAB_queryTblEnumItem_256_v').setAttribute('type', 'text');")
        driver.find_element_by_id('TAB_queryTblEnumItem_256_v').clear()
        driver.find_element_by_id("TAB_queryTblEnumItem_256_v").send_keys('3203')

        driver.find_element_by_id("TAB_QueryButtonControl").click()
except:
    pass


num = 183
for pages in range(1,num):
    for i in range(2,32):
        try:
            if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
                img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
                a = img_down_load(img)
                driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
                driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
        except:
            pass

        try:
            urls = driver.find_element_by_xpath("//*[@id='TAB_contentTable']/tbody/tr[%d]/td[3]/a"%i).get_attribute("href")
        except:
            driver.refresh()
            num += 1
            continue
        print(urls)
        list_info.append(urls)
    try:
        driver.find_element_by_xpath("//*[@id='mainModuleContainer_485_1113_1539_tdExtendProContainer']/table/tbody/tr[1]/td/table/tbody/tr[2]/td/div/table/tbody/tr/td[2]/a[12]").click()
    except:
        pass
    time.sleep(random.randint(2,5))
    with open("徐州.json", "w", encoding="utf8") as f:
        json.dump(list_info,f,indent=1)

driver.quit()

我們將爬取到的鏈接保存在json文件裏
在這裏插入圖片描述
第二步就是對鏈接進行詳情頁提取。

import time,json,random
from test import img_down_load
from selenium import webdriver

with open("徐州.json", "r", encoding="utf8") as f:
    urls = json.load(f)
opt = webdriver.ChromeOptions()
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
opt.add_argument('--user-agent=%s' % headers)
driver = webdriver.Chrome(options=opt)
for url in urls:
    print(url)
    driver.get(url)
    time.sleep(random.randint(2,5))
    tudi_dict = {}
    tudi_list = []
    while True:
        try:
            if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
                img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
                a = img_down_load(img)
                driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
                driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
                time.sleep(random.randint(2,5))
        except:
            print("沒有驗證碼")


        print("開始查找")
        try:
            xingzhengqu_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r1_c1_ctrl']").get_attribute("textContent")
            try:
                xingzhengqu_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r1_c2_ctrl']").get_attribute("textContent")
            except:
                xingzhengqu_value = ""
            print(xingzhengqu_key, xingzhengqu_value)

            xiangmu_name_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r17_c1_ctrl']").get_attribute("textContent")
            try:
                xiangmu_name_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r17_c2_ctrl']").get_attribute("textContent")
            except:
                xiangmu_name_value = ""
            print(xiangmu_name_key, xiangmu_name_value)

            xiangmu_weizhi_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r16_c1_ctrl']").get_attribute("textContent")
            try:
                xiangmu_weizhi_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r16_c2_ctrl']").get_attribute("textContent")
            except:
                xiangmu_weizhi_value = ""
            print(xiangmu_weizhi_key, xiangmu_weizhi_value)

            mianji_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c1_ctrl']").get_attribute("textContent")
            try:
                mianji_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c2_ctrl']").get_attribute("textContent")
            except:
                mianji_value = ""
            print(mianji_key, mianji_value)

            yongtu_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c1_ctrl']").get_attribute("textContent")
            try:
                yongtu_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c2_ctrl']").get_attribute("textContent")
            except:
                yongtu_value = ""
            print(yongtu_key, yongtu_value)

            fangshi_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c3_ctrl']").get_attribute("textContent")
            try:
                fangshi_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c4_ctrl']").get_attribute("textContent")
            except:
                fangshi_value = ""
            print(fangshi_key, fangshi_value)

            years_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c1_ctrl']").get_attribute("textContent")
            try:
                years_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c2_ctrl']").get_attribute("textContent")
            except:
                years_value = ""
            print(years_key, years_value)

            hangye_type_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c3_ctrl']").get_attribute("textContent")
            try:
                hangye_type_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c4_ctrl']").get_attribute("textContent")
            except:
                hangye_type_value = ""
            print(hangye_type_key, hangye_type_value)

            pice_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r20_c3_ctrl']").get_attribute("textContent")
            try:
                pice_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r20_c4_ctrl']").get_attribute("textContent")
            except:
                pice_value = ""
            print(pice_key, pice_value)

            shiyong_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r9_c1_ctrl']").get_attribute("textContent")
            try:
                shiyong_value_1 = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r9_c2_ctrl']").get_attribute("textContent")
            except:
                shiyong_value_1 = ""

            try:
                shiyong_value_2 = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r23_c2_ctrl']").get_attribute("textContent")
            except:
                shiyong_value_2 = ""


            print(shiyong_key, shiyong_value_1+shiyong_value_2)

            rongjilv_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r21_c1_ctrl']").get_attribute("textContent")
            try:
                rongjilv_next_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f2_r1_c2_ctrl']").get_attribute("textContent")
            except:
                rongjilv_next_value = ""

            try:
                rongjilv_up_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f2_r1_c4_ctrl']").get_attribute("textContent")
            except:
                rongjilv_up_value = ""
            print(rongjilv_key, rongjilv_next_value, rongjilv_up_value)

            riqi_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r14_c3_ctrl']").get_attribute("textContent")
            try:
                riqi_value = driver.find_element_by_xpath(
                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r14_c4_ctrl']").get_attribute("textContent")
            except:
                riqi_value = ""
            print(riqi_key, riqi_value)

            gongkai_time_key = driver.find_element_by_xpath(

                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c1_ctrl']").get_attribute("textContent")
            try:
                gongkai_time_value = driver.find_element_by_xpath("//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c2_ctrl']").get_attribute("textContent")
            except:
                gongkai_time_value = ""
            print(gongkai_time_key,gongkai_time_value)

            suogong_time_key = driver.find_element_by_xpath(
                "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c3_ctrl']").get_attribute("textContent")
            try:
                suogong_time_value = driver.find_element_by_xpath(

                    "//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c4_ctrl']").get_attribute("textContent")
            except:
                suogong_time_value = ""

            print(suogong_time_key, suogong_time_value)
            tudi_dict.update({xingzhengqu_key: xingzhengqu_value, xiangmu_name_key: xiangmu_name_value,
                              xiangmu_weizhi_key: xiangmu_weizhi_value, mianji_key: mianji_value,
                              yongtu_key: yongtu_value,
                              fangshi_key: fangshi_value, years_key: years_value, hangye_type_key: hangye_type_value,
                              pice_key: pice_value, shiyong_key: shiyong_value_1 + "/" + shiyong_value_2,
                              rongjilv_key: rongjilv_next_value + "/" + rongjilv_up_value, riqi_key: riqi_value,
                              gongkai_time_key: gongkai_time_value, suogong_time_key: suogong_time_value})
            tudi_list.append(tudi_dict)
            break
        except:
            driver.refresh()


    print(tudi_list)
    with open('徐州_信息.json', 'a', encoding="UTF-8") as f:
        for data in tudi_list:
            res = json.dumps(data, ensure_ascii=False) + ',\n'
            f.write(res)

driver.quit()

最終我們也是保存在json文件中。
在這裏插入圖片描述
目前爲止已經爬取到了江蘇六個城市的數據。對源碼和數據感興趣的朋友可以到我的Github進行學習。
Github地址:https://github.com/AnTi-anti/china_land/tree/master
在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章