【Python】爬取並下載Instagram賬戶中所有帖子的信息、圖片和視頻

目錄
0. 項目介紹
1. 構造瀏覽器實例
2. 登錄Instagram賬戶
3. 爬取賬戶頁中所有帖子的鏈接
4. 爬取並下載帖子頁中的信息、圖片和視頻
5. 完整代碼

0. 項目介紹

本項目的目的是輸入指定Instagram賬戶頁的鏈接,輸出該賬戶每一個帖子的:① 包含當前時間、用戶上傳帖子的時間(當前時區)、用戶名稱(Username)、用戶全稱(Full name)、帖子文字、點贊數、評論數、圖片描述(當帖子中有圖片時)、圖片鏈接(當帖子中有圖片時)、視頻觀看數(當帖子中有視頻時)、視頻鏈接(當帖子中有視頻時)的文本文檔;② 圖片(當帖子中有圖片時)、視頻(當帖子中有視頻時)。


本項目需要先導入如下庫:

from selenium import webdriver
from bs4 import BeautifulSoup
import json, time, os

本項目的基本結構如下:

def Main():
	profileUrl = input('Please input the instagram profile link: ')
	
	urlList = PROFILE().Main(profileUrl)
	
	for url in urlList:
		POST().Main(url)

解釋:在構造瀏覽器實例和登錄Instagram賬戶之後輸入Instagram賬戶頁的鏈接profileUrlPROFILE().Main(profileUrl)負責爬取賬戶頁中所有帖子的鏈接,生成鏈接列表urlListPOST().Main(url)負責下載每一個帖子的信息、圖片和視頻。

1. 構造瀏覽器實例

from selenium import webdriver

socksPort = 
httpPort = 
sslPort = 

fxProfile = webdriver.firefox.firefox_profile.FirefoxProfile()
fxProfile.set_preference('network.proxy.type', 1)
fxProfile.set_preference('network.proxy.socks', '127.0.0.1')
fxProfile.set_preference('network.proxy.socks_port', socksPort)
fxProfile.set_preference('network.proxy.http', '127.0.0.1')
fxProfile.set_preference('network.proxy.http_port', httpPort)
fxProfile.set_preference('network.proxy.ssl', '127.0.0.1')
fxProfile.set_preference('network.proxy.ssl_port', sslPort)
fxProfile.set_preference('network.proxy.socks_remote_dns', True)
fxProfile.set_preference('network.trr.mode', 2)
fxProfile.set_preference('permissions.default.image', 2)
fxProfile.set_preference('intl.accept_languages', 'zh-CN, zh, zh-TW, zh-HK, en-US, en')
fxBinaryPath = ''
geckodriverPath = ''
fxDriver = webdriver.firefox.webdriver.WebDriver(firefox_profile=fxProfile, firefox_binary=fxBinaryPath, executable_path=geckodriverPath)

解釋:可以參考這篇博文。應根據手頭上的工具分別設置SOCKS、HTTP和SSL代理的端口socksPorthttpPortsslPortfxBinaryPath是Firefox瀏覽器的firefox.exe的絕對路徑,geckodriverPathgeckodriver.exe的絕對路徑。

2. 登錄Instagram賬戶

account = ''
password = ''
fxDriver.get('https://www.instagram.com/accounts/login/')
webdriver.support.ui.WebDriverWait(fxDriver, 100).until(lambda x: x.find_element_by_name('username'))
fxDriver.find_element_by_name('username').send_keys(account)
fxDriver.find_element_by_name('password').send_keys(password)
fxDriver.find_element_by_xpath('//button[@type="submit"]').click()

解釋:account是你的Instagram用戶名,password是你的Instagram密碼。

3. 爬取賬戶頁中所有帖子的鏈接

這一步的基本結構如下:

def Main(self, profileUrl):
	try:
		fxDriver.get(profileUrl)
		urlList = PROFILE().GetWholePage()
		return urlList
	except Exception as e:
		print(e)

解釋:瀏覽器先訪問賬戶頁,然後PROFILE().GetWholePage()負責爬取賬戶頁中所有帖子的鏈接,生成鏈接列表urlList


PROFILE().GetWholePage()如下:

def GetWholePage(self):
	locY, urlList = PROFILE().GetLocY()
	loadFailCount = 0
	
	while 1:
		pageDownJS = 'document.documentElement.scrollTop=100000000'
		fxDriver.execute_script(pageDownJS)
		
		while 1:
			locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
			
			if locYNew == None:
				loadFailCount += 1
				if loadFailCount > 20:
					return urlList
			else:
				loadFailCount = 0
				locY = locYNew
				urlList = urlListNew
				break

解釋:

  1. PROFILE().GetLocY()可以獲得賬戶頁HTML中每個帖子鏈接所在tag的Y座標locY和當前加載的所有帖子的鏈接列表urlList
  2. fxDriver.execute_script(pageDownJS)可以通過執行JS代碼document.documentElement.scrollTop=100000000把頁面拉到最下面。
  3. PROFILE().JudgeLoading(locY, urlList)可以對比輸入的Y座標和0.5秒之後的Y座標來判斷fxDriver.execute_script(pageDownJS)有沒有執行完畢。如果沒有執行完畢則返回None,如果執行完畢則返回新的Y座標locYNew和新的鏈接列表urlListNew

PROFILE().GetLocY()如下:

def GetLocY(self):
	urlList = []
	
	for e in fxDriver.find_elements_by_tag_name('a'):
		try:
			url = e.get_attribute('href')
			if '/p/' in url:
				locY = e.location['y']
				urlList.append(url)
		except:
			continue
	
	return locY, urlList

解釋:通過循環判斷'/p/'有沒有在a標籤的'href'屬性中來獲得帖子鏈接及其所在tag的Y座標。


PROFILE().JudgeLoading(locY, urlList)如下:

def JudgeLoading(self, locY, urlList):
	time.sleep(0.5)
	
	locYNew, urlListNew = PROFILE().GetLocY()
	
	if locY < locYNew:
		urlListNew += urlList
		urlListNew = list(set(urlListNew))
		return locYNew, urlListNew
	else:
		return None, None

把上述模塊如下整合到類中:

class PROFILE(object):
	
	def GetLocY(self):
		urlList = []
		
		for e in fxDriver.find_elements_by_tag_name('a'):
			try:
				url = e.get_attribute('href')
				if '/p/' in url:
					locY = e.location['y']
					urlList.append(url)
			except:
				continue
		
		return locY, urlList
	
	def JudgeLoading(self, locY, urlList):
		time.sleep(0.5)
		
		locYNew, urlListNew = PROFILE().GetLocY()
		
		if locY < locYNew:
			urlListNew += urlList
			urlListNew = list(set(urlListNew))
			return locYNew, urlListNew
		else:
			return None, None
	
	def GetWholePage(self):
		locY, urlList = PROFILE().GetLocY()
		loadFailCount = 0
		
		while 1:
			pageDownJS = 'document.documentElement.scrollTop=100000000'
			fxDriver.execute_script(pageDownJS)
			
			while 1:
				locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
				
				if locYNew == None:
					loadFailCount += 1
					if loadFailCount > 20:
						return urlList
				else:
					loadFailCount = 0
					locY = locYNew
					urlList = urlListNew
					break
	
	def Main(self, profileUrl):
		try:
			fxDriver.get(profileUrl)
			urlList = PROFILE().GetWholePage()
			return urlList
		except Exception as e:
			print(e)

解釋:可以通過調用PROFILE().Main(profileUrl)獲得賬戶頁中所有帖子的鏈接。

4. 爬取並下載帖子頁中的信息、圖片和視頻

這一步的基本結構如下:

def Main(self, url):
	try:
		fxDriver.get(url)
		html = fxDriver.page_source
		info = POST().GetInfo(html)
		POST().DownloadInfo(info)
		POST().DownloadFile(info)
	except Exception as e:
		print(e)

解釋:瀏覽器先訪問帖子頁;然後通過fxDriver.page_source獲取帖子頁的HTML;POST().GetInfo(html)可以通過分析HTML獲得用戶上傳帖子的時間(當前時區)、用戶名稱(Username)、用戶全稱(Full name)、帖子文字、點贊數、評論數、圖片描述(當帖子中有圖片時)、圖片鏈接(當帖子中有圖片時)、視頻觀看數(當帖子中有視頻時)、視頻鏈接(當帖子中有視頻時)等信息;POST().DownloadInfo(info)把信息寫入文本文檔;POST().DownloadFile(info)根據獲取的信息下載帖子頁中的圖片和視頻。


POST().GetInfo(html)如下:

def GetInfo(self, html):
	soup = BeautifulSoup(html, 'html.parser')
	for s in soup.find_all('script', {'type':'text/javascript'}):
		if s.string is not None and 'graphql' in s.string:
			jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
			break
	
	uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
	uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
	username = jsonData['graphql']['shortcode_media']['owner']['username']
	fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
	likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
	comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
	try:
		text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
	except:
		text = 'None'
	
	try:
		displayDict = {}
		for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
			displayUrl = obj['node']['display_url']
			picDescription = obj['node']['accessibility_caption']
			displayDict[displayUrl] = picDescription
		return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
	except:
		try:
			videoUrl = jsonData['graphql']['shortcode_media']['video_url']
			videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
			return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
		except:
			displayUrl = jsonData['graphql']['shortcode_media']['display_url']
			picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
			return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'

解釋:帖子中我們需要的所有信息都在jsonData中。我們通過判斷'graphql'是否在'type'屬性爲'text/javascript'script標籤中來獲取jsonData


POST().DownloadInfo(info)如下:

def DownloadInfo(self, info):
	now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
	uploadTime = info[0]
	username = info[1]
	fullName = info[2]
	likes = info[3]
	comments = info[4]
	text = info[5]
	folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
	
	try:
		os.makedirs(folder)
	except Exception as e:
		print(e)
	
	with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
		f.write('Now: {}'.format(now))
		f.write('\nUpload time: {}'.format(uploadTime))
		f.write('\nUsername: {}'.format(username))
		f.write('\nFull name: {}'.format(fullName))
		f.write('\nText: {}'.format(text))
		f.write('\nLikes: {}'.format(likes))
		f.write('\nComments: {}'.format(comments))
		
		if info[-1] == 'ps':
			displayDict = info[6]
			picIdx = 1
			for displayUrl, picDescription in displayDict.items():
				f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
				f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
				picIdx += 1
		elif info[-1] == 'v':
			videoUrl = info[6]
			videoViewCount = info[7]
			f.write('\nVideo view count: {}'.format(videoViewCount))
			f.write('\nVideo url: {}'.format(videoUrl))
		elif info[-1] == 'p':
			displayUrl = info[6]
			picDescription = info[7]
			f.write('\nPicture description: {}'.format(picDescription))
			f.write('\nPicture url: {}'.format(displayUrl))

解釋:outputPath是全局變量,表示輸入文件夾的絕對路徑。


POST().DownloadFile(info)如下:

def DownloadFile(self, info):
	uploadTime = info[0]
	username = info[1]
	folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
	
	if info[-1] == 'ps':
		displayDict = info[6]
		i = 1
		for displayUrl in displayDict.keys():
			os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
			i += 1
	elif info[-1] == 'v':
		videoUrl = info[6]
		os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
	elif info[-1] == 'p':
		displayUrl = info[6]
		os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))

解釋:可以參考這篇博文wgetPath是全局變量,表示wget.exe的絕對路徑;httpProxy是全局變量,表示HTTP代理;httpsProxy是全局變量,表示HTTPS代理。


把上述模塊如下整合到類中:

class POST(object):
	
	def GetInfo(self, html):
		soup = BeautifulSoup(html, 'html.parser')
		for s in soup.find_all('script', {'type':'text/javascript'}):
			if s.string is not None and 'graphql' in s.string:
				jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
				break
		
		uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
		uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
		username = jsonData['graphql']['shortcode_media']['owner']['username']
		fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
		likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
		comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
		try:
			text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
		except:
			text = 'None'
		
		try:
			displayDict = {}
			for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
				displayUrl = obj['node']['display_url']
				picDescription = obj['node']['accessibility_caption']
				displayDict[displayUrl] = picDescription
			return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
		except:
			try:
				videoUrl = jsonData['graphql']['shortcode_media']['video_url']
				videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
				return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
			except:
				displayUrl = jsonData['graphql']['shortcode_media']['display_url']
				picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
				return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'
	
	def DownloadInfo(self, info):
		now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
		uploadTime = info[0]
		username = info[1]
		fullName = info[2]
		likes = info[3]
		comments = info[4]
		text = info[5]
		folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
		
		try:
			os.makedirs(folder)
		except Exception as e:
			print(e)
		
		with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
			f.write('Now: {}'.format(now))
			f.write('\nUpload time: {}'.format(uploadTime))
			f.write('\nUsername: {}'.format(username))
			f.write('\nFull name: {}'.format(fullName))
			f.write('\nText: {}'.format(text))
			f.write('\nLikes: {}'.format(likes))
			f.write('\nComments: {}'.format(comments))
			
			if info[-1] == 'ps':
				displayDict = info[6]
				picIdx = 1
				for displayUrl, picDescription in displayDict.items():
					f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
					f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
					picIdx += 1
			elif info[-1] == 'v':
				videoUrl = info[6]
				videoViewCount = info[7]
				f.write('\nVideo view count: {}'.format(videoViewCount))
				f.write('\nVideo url: {}'.format(videoUrl))
			elif info[-1] == 'p':
				displayUrl = info[6]
				picDescription = info[7]
				f.write('\nPicture description: {}'.format(picDescription))
				f.write('\nPicture url: {}'.format(displayUrl))
	
	def DownloadFile(self, info):
		uploadTime = info[0]
		username = info[1]
		folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
		
		if info[-1] == 'ps':
			displayDict = info[6]
			i = 1
			for displayUrl in displayDict.keys():
				os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
				i += 1
		elif info[-1] == 'v':
			videoUrl = info[6]
			os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
		elif info[-1] == 'p':
			displayUrl = info[6]
			os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))
	
	def Main(self, url):
		try:
			fxDriver.get(url)
			html = fxDriver.page_source
			info = POST().GetInfo(html)
			POST().DownloadInfo(info)
			POST().DownloadFile(info)
		except Exception as e:
			print(e)

解釋:可以通過調用POST().Main(url)下載每一個帖子的信息、圖片和視頻。

5. 完整代碼

from selenium import webdriver
from bs4 import BeautifulSoup
import json, time, os

socksPort = 
httpPort = 
sslPort = 

fxProfile = webdriver.firefox.firefox_profile.FirefoxProfile()
fxProfile.set_preference('network.proxy.type', 1)
fxProfile.set_preference('network.proxy.socks', '127.0.0.1')
fxProfile.set_preference('network.proxy.socks_port', socksPort)
fxProfile.set_preference('network.proxy.http', '127.0.0.1')
fxProfile.set_preference('network.proxy.http_port', httpPort)
fxProfile.set_preference('network.proxy.ssl', '127.0.0.1')
fxProfile.set_preference('network.proxy.ssl_port', sslPort)
fxProfile.set_preference('network.proxy.socks_remote_dns', True)
fxProfile.set_preference('network.trr.mode', 2)
fxProfile.set_preference('permissions.default.image', 2)
fxProfile.set_preference('intl.accept_languages', 'zh-CN, zh, zh-TW, zh-HK, en-US, en')
fxBinaryPath = ''
geckodriverPath = ''
fxDriver = webdriver.firefox.webdriver.WebDriver(firefox_profile=fxProfile, firefox_binary=fxBinaryPath, executable_path=geckodriverPath)

account = ''
password = ''
fxDriver.get('https://www.instagram.com/accounts/login/')
webdriver.support.ui.WebDriverWait(fxDriver, 100).until(lambda x: x.find_element_by_name('username'))
fxDriver.find_element_by_name('username').send_keys(account)
fxDriver.find_element_by_name('password').send_keys(password)
fxDriver.find_element_by_xpath('//button[@type="submit"]').click()

outputPath = ''
wgetPath = ''
httpProxy = 'http://127.0.0.1:{}/'.format(str(httpPort))
httpsProxy = 'https://127.0.0.1:{}/'.format(str(sslPort))

class POST(object):
	
	def GetInfo(self, html):
		soup = BeautifulSoup(html, 'html.parser')
		for s in soup.find_all('script', {'type':'text/javascript'}):
			if s.string is not None and 'graphql' in s.string:
				jsonData = json.loads(s.string[s.string.find('{'): s.string.rfind('}') + 1])
				break
		
		uploadTimeStamp = jsonData['graphql']['shortcode_media']['taken_at_timestamp']
		uploadTime = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(uploadTimeStamp))
		username = jsonData['graphql']['shortcode_media']['owner']['username']
		fullName = jsonData['graphql']['shortcode_media']['owner']['full_name']
		likes = jsonData['graphql']['shortcode_media']['edge_media_preview_like']['count']
		comments = jsonData['graphql']['shortcode_media']['edge_media_preview_comment']['count']
		try:
			text = jsonData['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
		except:
			text = 'None'
		
		try:
			displayDict = {}
			for obj in jsonData['graphql']['shortcode_media']['edge_sidecar_to_children']['edges']:
				displayUrl = obj['node']['display_url']
				picDescription = obj['node']['accessibility_caption']
				displayDict[displayUrl] = picDescription
			return uploadTime, username, fullName, likes, comments, text, displayDict, 'ps'
		except:
			try:
				videoUrl = jsonData['graphql']['shortcode_media']['video_url']
				videoViewCount = jsonData['graphql']['shortcode_media']['video_view_count']
				return uploadTime, username, fullName, likes, comments, text, videoUrl, videoViewCount, 'v'
			except:
				displayUrl = jsonData['graphql']['shortcode_media']['display_url']
				picDescription = jsonData['graphql']['shortcode_media']['accessibility_caption']
				return uploadTime, username, fullName, likes, comments, text, displayUrl, picDescription, 'p'
	
	def DownloadInfo(self, info):
		now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(time.time())))
		uploadTime = info[0]
		username = info[1]
		fullName = info[2]
		likes = info[3]
		comments = info[4]
		text = info[5]
		folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
		
		try:
			os.makedirs(folder)
		except Exception as e:
			print(e)
		
		with open('{}\\info.txt'.format(folder), 'w', encoding='utf-8') as f:
			f.write('Now: {}'.format(now))
			f.write('\nUpload time: {}'.format(uploadTime))
			f.write('\nUsername: {}'.format(username))
			f.write('\nFull name: {}'.format(fullName))
			f.write('\nText: {}'.format(text))
			f.write('\nLikes: {}'.format(likes))
			f.write('\nComments: {}'.format(comments))
			
			if info[-1] == 'ps':
				displayDict = info[6]
				picIdx = 1
				for displayUrl, picDescription in displayDict.items():
					f.write('\nPicture {} description: {}'.format(str(picIdx), picDescription))
					f.write('\nPicture {} url: {}'.format(str(picIdx), displayUrl))
					picIdx += 1
			elif info[-1] == 'v':
				videoUrl = info[6]
				videoViewCount = info[7]
				f.write('\nVideo view count: {}'.format(videoViewCount))
				f.write('\nVideo url: {}'.format(videoUrl))
			elif info[-1] == 'p':
				displayUrl = info[6]
				picDescription = info[7]
				f.write('\nPicture description: {}'.format(picDescription))
				f.write('\nPicture url: {}'.format(displayUrl))
	
	def DownloadFile(self, info):
		uploadTime = info[0]
		username = info[1]
		folder = '{}\\{}\\{}'.format(outputPath, username, uploadTime.replace('-', '').replace(':', '').replace(' ', ''))
		
		if info[-1] == 'ps':
			displayDict = info[6]
			i = 1
			for displayUrl in displayDict.keys():
				os.system('{} --output-document={}\\{}.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, str(i), httpProxy, httpsProxy, displayUrl))
				i += 1
		elif info[-1] == 'v':
			videoUrl = info[6]
			os.system('{} --output-document={}\\1.mp4 --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, videoUrl))
		elif info[-1] == 'p':
			displayUrl = info[6]
			os.system('{} --output-document={}\\1.png --no-check-certificate --execute http_proxy={} --execute https_proxy={} --execute robots=off --continue "{}"'.format(wgetPath, folder, httpProxy, httpsProxy, displayUrl))
	
	def Main(self, url):
		try:
			fxDriver.get(url)
			html = fxDriver.page_source
			info = POST().GetInfo(html)
			POST().DownloadInfo(info)
			POST().DownloadFile(info)
		except Exception as e:
			print(e)

class PROFILE(object):
	
	def GetLocY(self):
		urlList = []
		
		for e in fxDriver.find_elements_by_tag_name('a'):
			try:
				url = e.get_attribute('href')
				if '/p/' in url:
					locY = e.location['y']
					urlList.append(url)
			except:
				continue
		
		return locY, urlList
	
	def JudgeLoading(self, locY, urlList):
		time.sleep(0.5)
		
		locYNew, urlListNew = PROFILE().GetLocY()
		
		if locY < locYNew:
			urlListNew += urlList
			urlListNew = list(set(urlListNew))
			return locYNew, urlListNew
		else:
			return None, None
	
	def GetWholePage(self):
		locY, urlList = PROFILE().GetLocY()
		loadFailCount = 0
		
		while 1:
			pageDownJS = 'document.documentElement.scrollTop=100000000'
			fxDriver.execute_script(pageDownJS)
			
			while 1:
				locYNew, urlListNew = PROFILE().JudgeLoading(locY, urlList)
				
				if locYNew == None:
					loadFailCount += 1
					if loadFailCount > 20:
						return urlList
				else:
					loadFailCount = 0
					locY = locYNew
					urlList = urlListNew
					break
	
	def Main(self, profileUrl):
		try:
			fxDriver.get(profileUrl)
			urlList = PROFILE().GetWholePage()
			return urlList
		except Exception as e:
			print(e)

def Main():
	profileUrl = input('Please input the instagram profile link: ')
	
	urlList = PROFILE().Main(profileUrl)
	
	for url in urlList:
		POST().Main(url)
	
	Main()

if __name__ == '__main__':
	Main()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章