實習期間老闆需要對土地數據進行爬取,想來想去就Google到了中國土地網.應該算是種類最齊全的土地數據了。
Github地址:https://github.com/AnTi-anti/china_land/tree/master
目標分析
需要提取的信息就是上方最終表格的信息。不同於上一篇爬取徐州市自然資源和規劃局土地數據.這次爬取會涉及到幾個難點。
網頁結構分析
我們首先進入官網,依次點擊土地供應,結果公告。
就進入了接下來這個頁面。
因爲我們需要的是2015-2020年的數據。而且是根據行政區來依次進行爬取。所以,肯定需要用到selenium進行爬取。和之前那一篇結構類似。也是先爬取土地坐落的鏈接,然後再爬取鏈接的詳情頁面。
難點
第一種情況就是會出現禁止訪問500,,一開始用的是免費的代理池不斷更換IP,但是由於IP有限,所以爬取速度還不如沒有代理的快些。後面就放棄了使用代理的想法。所以說遇到這種情況,只能暫時中止。但是在爬取鏈接頁面的時候這種情況不會出現,而在對詳情頁進行提取的時候會出現。這時候我採取的措施是直接剔除掉已經爬取的鏈接,對剩餘沒有用到的鏈接繼續提取詳情頁信息。
一開始我使用的是自己的寬帶和局域網,經常會出現這種禁止訪問的情況。但是後來我使用了華爲的服務器竟然就沒有這種情況。而且在同一時間段,我本機和與雲服務器同時在跑,本地就禁止,而云服務器依舊運行,很奇怪,如果有懂硬件的朋友瞭解的話,可以評論區告訴我,我也不知道自己猜的對不對。
第二種情況就是會在爬取的過程中頻繁出現驗證碼的步驟。這個也不難,我們可以直接對其進行識別。如果出現驗證碼,則對其進行識別;否則,繼續進行爬取。
def img_down_load(img):
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=MQE9mLzD9296AQQ7byq40Iud&client_secret=n1ElwPtvGTBua67hyLIPZtp5IGciGGjV'
response = requests.get(host)
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
# 二進制方式打開圖片文件
img = img.split(",")
params = {"image": img[1]}
access_token = response.json()['access_token']
request_url = request_url + "?access_token=" + access_token
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
# headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
if response:
counts = response.json()['words_result'][0]["words"]
return counts
第三種就是隻顯示前200頁的數據。如果你是翻頁進行爬取的話,那麼只有200頁的數據可以獲取到,這時候我們就需要對時間進行篩選,在上述篩選框輸入時間即可,再啓動翻頁設置。
第四種就是行政區域的設置。我們通過對源代碼進行分析,發現可以通過輸入城市的身份證前4位進行判別。
數據爬取
第一步還是老樣子,先對鏈接進行爬取。
import requests
def img_down_load(img):
host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=MQE9mLzD9296AQQ7byq40Iud&client_secret=n1ElwPtvGTBua67hyLIPZtp5IGciGGjV'
response = requests.get(host)
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic"
# 二進制方式打開圖片文件
img = img.split(",")
params = {"image": img[1]}
access_token = response.json()['access_token']
request_url = request_url + "?access_token=" + access_token
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
# headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
if response:
counts = response.json()['words_result'][0]["words"]
return counts
import time,json,random
from test import img_down_load
from selenium import webdriver
opt = webdriver.ChromeOptions()
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
opt.add_argument('--user-agent=%s' % headers)
#opt.add_argument("--proxy-server=http://202.20.16.82:10152")
driver = webdriver.Chrome(options=opt)
driver.get("https://www.landchina.com/default.aspx?tabid=263&ComName=default")
time.sleep(random.randint(2,5))
while True:
if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
a = img_down_load(img)
driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
time.sleep(random.randint(2,5))
try:
driver.find_element_by_id("TAB_QueryConditionItem256").click()
driver.execute_script("document.getElementById('TAB_queryTblEnumItem_256_v').setAttribute('type', 'text');")
driver.find_element_by_id('TAB_queryTblEnumItem_256_v').clear()
driver.find_element_by_id("TAB_queryTblEnumItem_256_v").send_keys('3203')
driver.find_element_by_id("TAB_QueryButtonControl").click()
break
except:
driver.refresh()
continue
list_info = []
time.sleep(random.randint(2,5))
try:
if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
a = img_down_load(img)
driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
driver.find_element_by_id("TAB_QueryConditionItem256").click()
driver.execute_script("document.getElementById('TAB_queryTblEnumItem_256_v').setAttribute('type', 'text');")
driver.find_element_by_id('TAB_queryTblEnumItem_256_v').clear()
driver.find_element_by_id("TAB_queryTblEnumItem_256_v").send_keys('3203')
driver.find_element_by_id("TAB_QueryButtonControl").click()
except:
pass
num = 183
for pages in range(1,num):
for i in range(2,32):
try:
if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
a = img_down_load(img)
driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
except:
pass
try:
urls = driver.find_element_by_xpath("//*[@id='TAB_contentTable']/tbody/tr[%d]/td[3]/a"%i).get_attribute("href")
except:
driver.refresh()
num += 1
continue
print(urls)
list_info.append(urls)
try:
driver.find_element_by_xpath("//*[@id='mainModuleContainer_485_1113_1539_tdExtendProContainer']/table/tbody/tr[1]/td/table/tbody/tr[2]/td/div/table/tbody/tr/td[2]/a[12]").click()
except:
pass
time.sleep(random.randint(2,5))
with open("徐州.json", "w", encoding="utf8") as f:
json.dump(list_info,f,indent=1)
driver.quit()
我們將爬取到的鏈接保存在json文件裏
第二步就是對鏈接進行詳情頁提取。
import time,json,random
from test import img_down_load
from selenium import webdriver
with open("徐州.json", "r", encoding="utf8") as f:
urls = json.load(f)
opt = webdriver.ChromeOptions()
headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}
opt.add_argument('--user-agent=%s' % headers)
driver = webdriver.Chrome(options=opt)
for url in urls:
print(url)
driver.get(url)
time.sleep(random.randint(2,5))
tudi_dict = {}
tudi_list = []
while True:
try:
if driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").get_attribute("value") == "點擊繼續訪問網站":
img = driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[1]/td[3]/img").get_attribute("src")
a = img_down_load(img)
driver.find_element_by_xpath("//*[@id='intext']").send_keys(a)
driver.find_element_by_xpath("/html/body/div/div[2]/table/tbody/tr[2]/td/input").click()
time.sleep(random.randint(2,5))
except:
print("沒有驗證碼")
print("開始查找")
try:
xingzhengqu_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r1_c1_ctrl']").get_attribute("textContent")
try:
xingzhengqu_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r1_c2_ctrl']").get_attribute("textContent")
except:
xingzhengqu_value = ""
print(xingzhengqu_key, xingzhengqu_value)
xiangmu_name_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r17_c1_ctrl']").get_attribute("textContent")
try:
xiangmu_name_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r17_c2_ctrl']").get_attribute("textContent")
except:
xiangmu_name_value = ""
print(xiangmu_name_key, xiangmu_name_value)
xiangmu_weizhi_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r16_c1_ctrl']").get_attribute("textContent")
try:
xiangmu_weizhi_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r16_c2_ctrl']").get_attribute("textContent")
except:
xiangmu_weizhi_value = ""
print(xiangmu_weizhi_key, xiangmu_weizhi_value)
mianji_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c1_ctrl']").get_attribute("textContent")
try:
mianji_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r2_c2_ctrl']").get_attribute("textContent")
except:
mianji_value = ""
print(mianji_key, mianji_value)
yongtu_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c1_ctrl']").get_attribute("textContent")
try:
yongtu_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c2_ctrl']").get_attribute("textContent")
except:
yongtu_value = ""
print(yongtu_key, yongtu_value)
fangshi_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c3_ctrl']").get_attribute("textContent")
try:
fangshi_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r3_c4_ctrl']").get_attribute("textContent")
except:
fangshi_value = ""
print(fangshi_key, fangshi_value)
years_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c1_ctrl']").get_attribute("textContent")
try:
years_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c2_ctrl']").get_attribute("textContent")
except:
years_value = ""
print(years_key, years_value)
hangye_type_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c3_ctrl']").get_attribute("textContent")
try:
hangye_type_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r19_c4_ctrl']").get_attribute("textContent")
except:
hangye_type_value = ""
print(hangye_type_key, hangye_type_value)
pice_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r20_c3_ctrl']").get_attribute("textContent")
try:
pice_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r20_c4_ctrl']").get_attribute("textContent")
except:
pice_value = ""
print(pice_key, pice_value)
shiyong_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r9_c1_ctrl']").get_attribute("textContent")
try:
shiyong_value_1 = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r9_c2_ctrl']").get_attribute("textContent")
except:
shiyong_value_1 = ""
try:
shiyong_value_2 = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r23_c2_ctrl']").get_attribute("textContent")
except:
shiyong_value_2 = ""
print(shiyong_key, shiyong_value_1+shiyong_value_2)
rongjilv_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r21_c1_ctrl']").get_attribute("textContent")
try:
rongjilv_next_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f2_r1_c2_ctrl']").get_attribute("textContent")
except:
rongjilv_next_value = ""
try:
rongjilv_up_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f2_r1_c4_ctrl']").get_attribute("textContent")
except:
rongjilv_up_value = ""
print(rongjilv_key, rongjilv_next_value, rongjilv_up_value)
riqi_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r14_c3_ctrl']").get_attribute("textContent")
try:
riqi_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r14_c4_ctrl']").get_attribute("textContent")
except:
riqi_value = ""
print(riqi_key, riqi_value)
gongkai_time_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c1_ctrl']").get_attribute("textContent")
try:
gongkai_time_value = driver.find_element_by_xpath("//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c2_ctrl']").get_attribute("textContent")
except:
gongkai_time_value = ""
print(gongkai_time_key,gongkai_time_value)
suogong_time_key = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c3_ctrl']").get_attribute("textContent")
try:
suogong_time_value = driver.find_element_by_xpath(
"//*[@id='mainModuleContainer_1855_1856_ctl00_ctl00_p1_f1_r22_c4_ctrl']").get_attribute("textContent")
except:
suogong_time_value = ""
print(suogong_time_key, suogong_time_value)
tudi_dict.update({xingzhengqu_key: xingzhengqu_value, xiangmu_name_key: xiangmu_name_value,
xiangmu_weizhi_key: xiangmu_weizhi_value, mianji_key: mianji_value,
yongtu_key: yongtu_value,
fangshi_key: fangshi_value, years_key: years_value, hangye_type_key: hangye_type_value,
pice_key: pice_value, shiyong_key: shiyong_value_1 + "/" + shiyong_value_2,
rongjilv_key: rongjilv_next_value + "/" + rongjilv_up_value, riqi_key: riqi_value,
gongkai_time_key: gongkai_time_value, suogong_time_key: suogong_time_value})
tudi_list.append(tudi_dict)
break
except:
driver.refresh()
print(tudi_list)
with open('徐州_信息.json', 'a', encoding="UTF-8") as f:
for data in tudi_list:
res = json.dumps(data, ensure_ascii=False) + ',\n'
f.write(res)
driver.quit()
最終我們也是保存在json文件中。
目前爲止已經爬取到了江蘇六個城市的數據。對源碼和數據感興趣的朋友可以到我的Github進行學習。
Github地址:https://github.com/AnTi-anti/china_land/tree/master