目錄 |
---|
0. 項目介紹 |
1. 構造瀏覽器實例 |
2. 登錄Instagram賬戶 |
3. 爬取賬戶頁中所有帖子的鏈接 |
4. 爬取並下載帖子頁中的信息、圖片和視頻 |
5. 完整代碼 |
0. 項目介紹
本項目的目的是輸入指定Instagram賬戶頁的鏈接,輸出該賬戶每一個帖子的:① 包含當前時間、用戶上傳帖子的時間(當前時區)、用戶名稱(Username)、用戶全稱(Full name)、帖子文字、點贊數、評論數、圖片描述(當帖子中有圖片時)、圖片鏈接(當帖子中有圖片時)、視頻觀看數(當帖子中有視頻時)、視頻鏈接(當帖子中有視頻時)的文本文檔;② 圖片(當帖子中有圖片時)、視頻(當帖子中有視頻時)。
本項目需要先導入如下庫:
from selenium import webdriver
from bs4 import BeautifulSoup
import json, time, os
本項目的基本結構如下:
def Main():
profileUrl = input('Please input the instagram profile link: ')
urlList = PROFILE().Main(profileUrl)
for url in urlList:
POST().Main(url)
解釋:在構造瀏覽器實例和登錄Instagram賬戶之後輸入Instagram賬戶頁的鏈接profileUrl
;PROFILE().Main(profileUrl)
負責爬取賬戶頁中所有帖子的鏈接,生成鏈接列表urlList
;POST().Main(url)
負責下載每一個帖子的信息、圖片和視頻。
1. 構造瀏覽器實例
from selenium import webdriver
socksPort =
httpPort =
sslPort =
fxProfile = webdriver.firefox.firefox_profile.FirefoxProfile()
fxProfile.set_preference('network.proxy.type', 1)
fxProfile.set_preference('network.proxy.socks', '127.0.0.1')
fxProfile.set_preference('network.proxy.socks_port', socksPort)
fxProfile.set_preference('network.proxy.http', '127.0.0.1')
fxProfile.set_preference('network.proxy.http_port', httpPort)
fxProfile.set_preference('network.proxy.ssl', '127.0.0.1')
fxProfile.set_preference('network.proxy.ssl_port', sslPort)
fxProfile.set_preference('network.proxy.socks_remote_dns', True)
fxProfile.set_preference('network.trr.mode', 2)
fxProfile.set_preference('permissions.default.image', 2)
fxProfile.set_preference('intl.accept_languages', 'zh-CN, zh, zh-TW, zh-HK, en-US, en')
fxBinaryPath = ''
geckodriverPath = ''
fxDriver = webdriver.firefox.webdriver.WebDriver(firefox_profile=fxProfile, firefox_binary=fxBinaryPath, executable_path=geckodriverPath)
解釋:可以參考這篇博文。應根據手頭上的工具分別設置SOCKS、HTTP和SSL代理的端口socksPort
、httpPort
和sslPort
。fxBinaryPath
是Firefox瀏覽器的firefox.exe
的絕對路徑,geckodriverPath
是geckodriver.exe
的絕對路徑。
2. 登錄Instagram賬戶
account = ''
password = ''
fxDriver.get('https://www.instagram.com/accounts/login/')
webdriver.support.ui.WebDriverWait(fxDriver, 100).until(lambda x: x.find_element_by_name('username'))
fxDriver.find_element_by_name('username').send_keys(account)
fxDriver.find_element_by_name('password').send_keys(password)
fxDriver.find_element_by_xpath('//button[@type="submit"]').click()
解釋:account
是你的Instagram用戶名,password
是你的Instagram密碼。
3. 爬取賬戶頁中所有帖子的鏈接
這一步的基本結構如下:
def Main(self, profileUrl):
try:
fxDriver.get(profileUrl)
urlList = PROFILE().GetWholePage()
return urlList
except Exception as e:
print(e)
解釋:瀏覽器先訪問賬戶頁,然後PROFILE().GetWholePage()
負責爬取賬戶頁中所有帖子的鏈接,生成鏈接列表urlList
。
PROFILE().GetWholePage()
如下:
def GetWholePage(self):
locY, urlList = PROFILE().GetLocY()
loadFailCount = 0
while 1:
pageDownJS = 'document.documentElement.scrollTop=100000000'
fxDriver.execute_script(pageDownJS)
while 1:
locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
if locYNew == None:
loadFailCount += 1
if loadFailCount > 20:
return urlList
else:
loadFailCount = 0
locY = locYNew
urlList = urlListNew
break
解釋:
PROFILE().GetLocY()
可以獲得賬戶頁HTML中每個帖子鏈接所在tag的Y座標locY
和當前加載的所有帖子的鏈接列表urlList
。fxDriver.execute_script(pageDownJS)
可以通過執行JS代碼document.documentElement.scrollTop=100000000
把頁面拉到最下面。PROFILE().JudgeLoading(locY, urlList)
可以對比輸入的Y座標和0.5秒之後的Y座標來判斷fxDriver.execute_script(pageDownJS)
有沒有執行完畢。如果沒有執行完畢則返回None
,如果執行完畢則返回新的Y座標locYNew
和新的鏈接列表urlListNew
。
PROFILE().GetLocY()
如下:
def GetLocY(self):
urlList = []
for e in fxDriver.find_elements_by_tag_name('a'):
try:
url = e.get_attribute('href')
if '/p/' in url:
locY = e.location['y']
urlList.append(url)
except:
continue
return locY, urlList
解釋:通過循環判斷'/p/'
有沒有在a
標籤的'href'
屬性中來獲得帖子鏈接及其所在tag的Y座標。
PROFILE().JudgeLoading(locY, urlList)
如下:
def JudgeLoading(self, locY, urlList):
time.sleep(0.5)
locYNew, urlListNew = PROFILE().GetLocY()
if locY < locYNew:
urlListNew += urlList
urlListNew = list(set(urlListNew))
return locYNew, urlListNew
else:
return None, None
把上述模塊如下整合到類中:
class PROFILE(object):
def GetLocY(self):
urlList = []
for e in fxDriver.find_elements_by_tag_name('a'):
try:
url = e.get_attribute('href')
if '/p/' in url:
locY = e.location['y']
urlList.append(url)
except:
continue
return locY, urlList
def JudgeLoading(self, locY, urlList):
time.sleep(0.5)
locYNew, urlListNew = PROFILE().GetLocY()
if locY < locYNew:
urlListNew += urlList
urlListNew = list(set(urlListNew))
return locYNew, urlListNew
else:
return None, None
def GetWholePage(self):
locY, urlList = PROFILE().GetLocY()
loadFailCount = 0
while 1:
pageDownJS = 'document.documentElement.scrollTop=100000000'
fxDriver.execute_script(pageDownJS)
while 1:
locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
if locYNew == None:
loadFailCount += 1
if loadFailCount > 20:
return urlList
else:
loadFailCount = 0
locY = locYNew
urlList = urlListNew
break
def Main(self, profileUrl):
try:
fxDriver.get(profileUrl)
urlList = PROFILE().GetWholePage()
return urlList
except Exception as e:
print(e)
解釋:可以通過調用PROFILE().Main(profileUrl)
獲得賬戶頁中所有帖子的鏈接。
4. 爬取並下載帖子頁中的信息、圖片和視頻
這一步的基本結構如下:
def Main(self, url):
try:
fxDriver.get(url)
html = fxDriver.page_source
info = POST().GetInfo(html)
POST().DownloadInfo(info)
POST().DownloadFile(info)
except Exception as e:
print(e)
解釋:瀏覽器先訪問帖子頁;然後通過fxDriver.page_source
獲取帖子頁的HTML;POST().GetInfo(html)
可以通過分析HTML獲得用戶上傳帖子的時間(當前時區)、用戶名稱(Username)、用戶全稱(Full name)、帖子文字、點贊數、評論數、圖片描述(當帖子中有圖片時)、圖片鏈接(當帖子中有圖片時)、視頻觀看數(當帖子中有視頻時)、視頻鏈接(當帖子中有視頻時)等信息;POST().DownloadInfo(info)
把信息寫入文本文檔;POST().DownloadFile(info)
根據獲取的信息下載帖子頁中的圖片和視頻。
POST().GetInfo(html)
如下:
def GetInfo(self, html):
soup = BeautifulSoup(html, 'html.parser')
for s in soup.find_all('script', {'type':'text/javascript'}):
if s.string is not None and 'graphql' in s.string:
jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
break
uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
username = jsonData['graphql']['shortcode_media']['owner']['username']
fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
try:
text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
except:
text = 'None'
try:
displayDict = {}
for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
displayUrl = obj['node']['display_url']
picDescription = obj['node']['accessibility_caption']
displayDict[displayUrl] = picDescription
return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
except:
try:
videoUrl = jsonData['graphql']['shortcode_media']['video_url']
videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
except:
displayUrl = jsonData['graphql']['shortcode_media']['display_url']
picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'
解釋:帖子中我們需要的所有信息都在jsonData
中。我們通過判斷'graphql'
是否在'type'
屬性爲'text/javascript'
的script
標籤中來獲取jsonData
。
POST().DownloadInfo(info)
如下:
def DownloadInfo(self, info):
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
uploadTime = info[0]
username = info[1]
fullName = info[2]
likes = info[3]
comments = info[4]
text = info[5]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
try:
os.makedirs(folder)
except Exception as e:
print(e)
with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
f.write('Now: {}'.format(now))
f.write('\nUpload time: {}'.format(uploadTime))
f.write('\nUsername: {}'.format(username))
f.write('\nFull name: {}'.format(fullName))
f.write('\nText: {}'.format(text))
f.write('\nLikes: {}'.format(likes))
f.write('\nComments: {}'.format(comments))
if info[-1] == 'ps':
displayDict = info[6]
picIdx = 1
for displayUrl, picDescription in displayDict.items():
f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
picIdx += 1
elif info[-1] == 'v':
videoUrl = info[6]
videoViewCount = info[7]
f.write('\nVideo view count: {}'.format(videoViewCount))
f.write('\nVideo url: {}'.format(videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
picDescription = info[7]
f.write('\nPicture description: {}'.format(picDescription))
f.write('\nPicture url: {}'.format(displayUrl))
解釋:outputPath
是全局變量,表示輸入文件夾的絕對路徑。
POST().DownloadFile(info)
如下:
def DownloadFile(self, info):
uploadTime = info[0]
username = info[1]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
if info[-1] == 'ps':
displayDict = info[6]
i = 1
for displayUrl in displayDict.keys():
os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
i += 1
elif info[-1] == 'v':
videoUrl = info[6]
os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))
解釋:可以參考這篇博文。wgetPath
是全局變量,表示wget.exe
的絕對路徑;httpProxy
是全局變量,表示HTTP代理;httpsProxy
是全局變量,表示HTTPS代理。
把上述模塊如下整合到類中:
class POST(object):
def GetInfo(self, html):
soup = BeautifulSoup(html, 'html.parser')
for s in soup.find_all('script', {'type':'text/javascript'}):
if s.string is not None and 'graphql' in s.string:
jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
break
uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
username = jsonData['graphql']['shortcode_media']['owner']['username']
fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
try:
text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
except:
text = 'None'
try:
displayDict = {}
for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
displayUrl = obj['node']['display_url']
picDescription = obj['node']['accessibility_caption']
displayDict[displayUrl] = picDescription
return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
except:
try:
videoUrl = jsonData['graphql']['shortcode_media']['video_url']
videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
except:
displayUrl = jsonData['graphql']['shortcode_media']['display_url']
picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'
def DownloadInfo(self, info):
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
uploadTime = info[0]
username = info[1]
fullName = info[2]
likes = info[3]
comments = info[4]
text = info[5]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
try:
os.makedirs(folder)
except Exception as e:
print(e)
with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
f.write('Now: {}'.format(now))
f.write('\nUpload time: {}'.format(uploadTime))
f.write('\nUsername: {}'.format(username))
f.write('\nFull name: {}'.format(fullName))
f.write('\nText: {}'.format(text))
f.write('\nLikes: {}'.format(likes))
f.write('\nComments: {}'.format(comments))
if info[-1] == 'ps':
displayDict = info[6]
picIdx = 1
for displayUrl, picDescription in displayDict.items():
f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
picIdx += 1
elif info[-1] == 'v':
videoUrl = info[6]
videoViewCount = info[7]
f.write('\nVideo view count: {}'.format(videoViewCount))
f.write('\nVideo url: {}'.format(videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
picDescription = info[7]
f.write('\nPicture description: {}'.format(picDescription))
f.write('\nPicture url: {}'.format(displayUrl))
def DownloadFile(self, info):
uploadTime = info[0]
username = info[1]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
if info[-1] == 'ps':
displayDict = info[6]
i = 1
for displayUrl in displayDict.keys():
os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
i += 1
elif info[-1] == 'v':
videoUrl = info[6]
os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))
def Main(self, url):
try:
fxDriver.get(url)
html = fxDriver.page_source
info = POST().GetInfo(html)
POST().DownloadInfo(info)
POST().DownloadFile(info)
except Exception as e:
print(e)
解釋:可以通過調用POST().Main(url)
下載每一個帖子的信息、圖片和視頻。
5. 完整代碼
from selenium import webdriver
from bs4 import BeautifulSoup
import json, time, os
socksPort =
httpPort =
sslPort =
fxProfile = webdriver.firefox.firefox_profile.FirefoxProfile()
fxProfile.set_preference('network.proxy.type', 1)
fxProfile.set_preference('network.proxy.socks', '127.0.0.1')
fxProfile.set_preference('network.proxy.socks_port', socksPort)
fxProfile.set_preference('network.proxy.http', '127.0.0.1')
fxProfile.set_preference('network.proxy.http_port', httpPort)
fxProfile.set_preference('network.proxy.ssl', '127.0.0.1')
fxProfile.set_preference('network.proxy.ssl_port', sslPort)
fxProfile.set_preference('network.proxy.socks_remote_dns', True)
fxProfile.set_preference('network.trr.mode', 2)
fxProfile.set_preference('permissions.default.image', 2)
fxProfile.set_preference('intl.accept_languages', 'zh-CN, zh, zh-TW, zh-HK, en-US, en')
fxBinaryPath = ''
geckodriverPath = ''
fxDriver = webdriver.firefox.webdriver.WebDriver(firefox_profile=fxProfile, firefox_binary=fxBinaryPath, executable_path=geckodriverPath)
account = ''
password = ''
fxDriver.get('https://www.instagram.com/accounts/login/')
webdriver.support.ui.WebDriverWait(fxDriver, 100).until(lambda x: x.find_element_by_name('username'))
fxDriver.find_element_by_name('username').send_keys(account)
fxDriver.find_element_by_name('password').send_keys(password)
fxDriver.find_element_by_xpath('//button[@type="submit"]').click()
outputPath = ''
wgetPath = ''
httpProxy = 'http://127.0.0.1:{}/'.format(str(httpPort))
httpsProxy = 'https://127.0.0.1:{}/'.format(str(sslPort))
class POST(object):
def GetInfo(self, html):
soup = BeautifulSoup(html, 'html.parser')
for s in soup.find_all('script', {'type':'text/javascript'}):
if s.string is not None and 'graphql' in s.string:
jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
break
uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
username = jsonData['graphql']['shortcode_media']['owner']['username']
fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
try:
text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
except:
text = 'None'
try:
displayDict = {}
for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
displayUrl = obj['node']['display_url']
picDescription = obj['node']['accessibility_caption']
displayDict[displayUrl] = picDescription
return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
except:
try:
videoUrl = jsonData['graphql']['shortcode_media']['video_url']
videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
except:
displayUrl = jsonData['graphql']['shortcode_media']['display_url']
picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'
def DownloadInfo(self, info):
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
uploadTime = info[0]
username = info[1]
fullName = info[2]
likes = info[3]
comments = info[4]
text = info[5]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
try:
os.makedirs(folder)
except Exception as e:
print(e)
with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
f.write('Now: {}'.format(now))
f.write('\nUpload time: {}'.format(uploadTime))
f.write('\nUsername: {}'.format(username))
f.write('\nFull name: {}'.format(fullName))
f.write('\nText: {}'.format(text))
f.write('\nLikes: {}'.format(likes))
f.write('\nComments: {}'.format(comments))
if info[-1] == 'ps':
displayDict = info[6]
picIdx = 1
for displayUrl, picDescription in displayDict.items():
f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
picIdx += 1
elif info[-1] == 'v':
videoUrl = info[6]
videoViewCount = info[7]
f.write('\nVideo view count: {}'.format(videoViewCount))
f.write('\nVideo url: {}'.format(videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
picDescription = info[7]
f.write('\nPicture description: {}'.format(picDescription))
f.write('\nPicture url: {}'.format(displayUrl))
def DownloadFile(self, info):
uploadTime = info[0]
username = info[1]
folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
if info[-1] == 'ps':
displayDict = info[6]
i = 1
for displayUrl in displayDict.keys():
os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
i += 1
elif info[-1] == 'v':
videoUrl = info[6]
os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
elif info[-1] == 'p':
displayUrl = info[6]
os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))
def Main(self, url):
try:
fxDriver.get(url)
html = fxDriver.page_source
info = POST().GetInfo(html)
POST().DownloadInfo(info)
POST().DownloadFile(info)
except Exception as e:
print(e)
class PROFILE(object):
def GetLocY(self):
urlList = []
for e in fxDriver.find_elements_by_tag_name('a'):
try:
url = e.get_attribute('href')
if '/p/' in url:
locY = e.location['y']
urlList.append(url)
except:
continue
return locY, urlList
def JudgeLoading(self, locY, urlList):
time.sleep(0.5)
locYNew, urlListNew = PROFILE().GetLocY()
if locY < locYNew:
urlListNew += urlList
urlListNew = list(set(urlListNew))
return locYNew, urlListNew
else:
return None, None
def GetWholePage(self):
locY, urlList = PROFILE().GetLocY()
loadFailCount = 0
while 1:
pageDownJS = 'document.documentElement.scrollTop=100000000'
fxDriver.execute_script(pageDownJS)
while 1:
locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
if locYNew == None:
loadFailCount += 1
if loadFailCount > 20:
return urlList
else:
loadFailCount = 0
locY = locYNew
urlList = urlListNew
break
def Main(self, profileUrl):
try:
fxDriver.get(profileUrl)
urlList = PROFILE().GetWholePage()
return urlList
except Exception as e:
print(e)
def Main():
profileUrl = input('Please input the instagram profile link: ')
urlList = PROFILE().Main(profileUrl)
for url in urlList:
POST().Main(url)
Main()
if __name__ == '__main__':
Main()