前言
作爲英語不算好; 又想離線看到所有比賽的中文視圖; 不得已必須利用爬蟲將其下載下來規劃到本地;
開發過程
- 1, 要能夠離線進行google翻譯
- 2, 提供各個詳情頁重要數據的預覽功能
- 3, 利用解析網頁工具得到md格式便於預覽
- 4, 存儲數據到數據庫
- 5, 每個競賽的 kenerl / discuss 高vote 進行爬取記錄
目前完成階段
1, 利用google翻譯API模擬瀏覽器進行翻譯
2, 競賽主頁的數據爬取
以上內容花了我整整一天的內容; 寫出來簡單, 但是每個函數的銜接, bug調試等都要花非常多的精力; 特別是google翻譯,網上幾乎沒有成熟的版本
google翻譯代碼片
難點介紹
- 1, 每個google翻譯的結果通過抓包發現網頁都會產生一個 tk; 這個tk是動態的, 這個通過網上的版本利用 pyexecjs 包進行JS代碼編譯。
- 2, 解析返回結果的 url; 注意是result[4,end]
- 3, open_url 這個比較簡單。模擬瀏覽器也不算難點; 略
# -*- coding: utf-8 -*-
import execjs
class GoogleTranslaterTk():
def __init__(self):
self.ctx = execjs.compile("""
function TL(a) {
var k = "";
var b = 406644;
var b1 = 3293161072;
var jd = ".";
var $b = "+-a^+6";
var Zb = "+-3^+b+-f";
for (var e = [], f = 0, g = 0; g < a.length; g++) {
var m = a.charCodeAt(g);
128 > m ? e[f++] = m : (2048 > m ? e[f++] = m >> 6 | 192 : (55296 == (m & 64512) && g + 1 < a.length && 56320 == (a.charCodeAt(g + 1) & 64512) ? (m = 65536 + ((m & 1023) << 10) + (a.charCodeAt(++g) & 1023),
e[f++] = m >> 18 | 240,
e[f++] = m >> 12 & 63 | 128) : e[f++] = m >> 12 | 224,
e[f++] = m >> 6 & 63 | 128),
e[f++] = m & 63 | 128)
}
a = b;
for (f = 0; f < e.length; f++) a += e[f],
a = RL(a, $b);
a = RL(a, Zb);
a ^= b1 || 0;
0 > a && (a = (a & 2147483647) + 2147483648);
a %= 1E6;
return a.toString() + jd + (a ^ b)
};
function RL(a, b) {
var t = "a";
var Yb = "+";
for (var c = 0; c < b.length - 2; c += 3) {
var d = b.charAt(c + 2),
d = d >= t ? d.charCodeAt(0) - 87 : Number(d),
d = b.charAt(c + 1) == Yb ? a >>> d: a << d;
a = b.charAt(c) == Yb ? a + d & 4294967295 : a ^ d
}
return a
}
""")
def getTk(self, text):
return self.ctx.call("TL", text)
import urllib.request
def open_url(url): #模擬瀏覽器解析網頁
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url, headers=headers) #python2,urllib.request()
response = urllib.request.urlopen(req) #python2,urllib2.urlopen()
data = response.read().decode('utf-8')
return data
def translate(content, tk):
if len(content) > 4891: ##這裏可以用try
print("翻譯長度過長;請注意分割")
return
content = urllib.parse.quote(content)
url = "http://translate.google.cn/translate_a/single?client=t" \
"&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca" \
"&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1" \
"&srcrom=0&ssel=0&tsel=0&kc=2&tk={}&q={}".format(tk, content)
result = open_url(url)
end = result.find("\",")
if end > 4:
return result[4:end]
def tranEn2Cn(content):
js = GoogleTranslaterTk()
return translate(content, js.getTk(content))
test.py
from translate_goole_sy import tranEn2Cn
print (tranEn2Cn("what are you want to do?!"))
def test2(string):
ls = string.split('\n')
with open('d:\\txt','w+') as f:
for i in ls:
if(not None):
f.writelines(tranEn2Cn(i))
f.close()
kaggle主頁競賽預覽
沒什麼難點; 主要是按格式解析內容; 花了挺長時間的
- 非貪婪匹配內容; df 數據寫入。
- 注意下面的哪些id的內容; 和 columuns 實際是可以動態產生的;
import urllib.request as ur
def open_url(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = ur.Request(url=url, headers=headers) # python2,urllib.request()
response = ur.urlopen(req) # python2,urllib2.urlopen()
return response.read().decode('utf-8')
'''<a class="block-link__anchor" href="/c/"\
"intel-mobileodt-cervical-cancer-screening"></a>'''
import pandas as pd
import numpy as np
import re
import json
def demo():
global n
regex = '{"competitionId":(.*?),' \
'"competitionTitle":(.*?),' \
'"competitionDescription":(.*?),' \
'"competitionUrl":(.*?),' \
'"thumbnailImageUrl":(.*?),' \
'"deadline":(.*?),' \
'"totalTeams":(.*?),' \
'"totalKernels":(.*?),' \
'"rewardQuantity":(.*?),' \
'"rewardTypeName":(.*?),' \
'"organizationName":(.*?),' \
'"organizationUrl":(.*?),' \
'"hostSegment":(.*?),' \
'"isLimited":(.*?),' \
'"isPrivate":(.*?),' \
'"isInClass":(.*?),' \
'"userHasEntered":(.*?),' \
'"rewardDisplay":(.*?)}'
columns = [ "competitionId","competitionTitle","competitionDescription",
"competitionUrl","thumbnailImageUrl","deadline","totalTeams","totalKernels","rewardQuantity",
"rewardTypeName","organizationName","organizationUrl","hostSegment","isLimited","isPrivate",
"isInClass","userHasEntered","rewardDisplay"]
n = len(columns)
ls = []
for i_pageNum in np.arange(1,15):
url2 = "https://www.kaggle.com/competitions?sortBy=deadline&group=all&page=" + \
str(i_pageNum) + "&segment=allCategories"
data = open_url(url2)
lis = re.findall(regex, data)
for x in lis:
ls.append(list(x))
num = 1
for i in ls:
print ("------", num,"--------")
print (i)
num += 1
df = pd.DataFrame(ls, columns=columns)
print (df)
df.to_csv("C:\\Users\\actanble\\Desktop\\de.csv")
if __name__ == "__main__":
demo()
後記
今天就回家去了, 後面3-4天都沒時間弄這個了, 本來說一鼓作氣, 兩三天弄好的…
實際上, 仔細一想; 官網的數據 排版和觀看實際上都是特別方便; 這個數據取下來主要目的是能夠利用數據進行一些快速查閱; 這個效用並不是很高。