大衆點評評分爬取-圖文識別ORC

十一了，沒出去玩，因爲老婆要加班，我陪着。
晚上的時候她說要一些點評的評分數據，我合計了一下scrapy request一下應該很好做，就答應下來了，感覺沒什麼難度嘛。
但是呢沒那麼簡單。需要人驗證的問題就不說了，我覺得這個我也解決不了，比較吸引我的是他的評分展現方式。
大衆點評這塊展示用的是圖片，css offset方式

selector那套行不通
這裏我使用的 tesseract 圖片文字識別
下面是大概流程

爬取頁面

這裏是使用Selenium進行頁面訪問，然後截屏
代碼片段


opt = Options()
opt.add_argument('--headless')
self.driver = webdriver.Chrome(executable_path='/Users/xiangc/bin/chromedriver', options=opt)
self.wait = WebDriverWait(self.driver, 10)
self.driver.get('http://www.dianping.com/shop/4227604')            self.driver.save_screenshot('image{}.png'.format(url_id))

截屏頁面

截取需要部分

代碼片段如下，這裏是hardcode，慚愧


 cropped_img = im.crop((239, 500, 239 + 780, 500 + 63)) 
 cropped_img.save('crop{}.png'.format(url_id))

圖片預處理

圖片預處理流程如下

清理噪點，如果一點四周只有一個非白點則爲噪點，去掉
非空白點着色，色值大於200的點直接給白色
提高圖片對比度


def get_color(image, x, y):
    if isinstance(image, type(Image.new('RGB', (0, 0), 'white'))):
        r, g, b = image.getpixel((x, y))[:3]
    else:
        r, g, b = image[x, y]
    return r, g, b


def is_noise(image, x, y):
    white_count = 0
    for i in range(0, x + 2):
        for j in range(0, y + 2):
            r, g, b = get_color(image, i, j)
            if (r, g, b) == (255, 255, 255):
                white_count += 1
    return white_count >= 7


def clear_noise(image, new_pixels):
    w, h = image.size
    clear_count = 0
    for i in range(w):
        for j in range(h):
            r, g, b = get_color(image, i, j)

            if r != g != b and is_noise(image, i, j):
                clear_count += 1
                print(clear_count)
                new_pixels[i, j] = (255, 255, 255)
            else:
                new_pixels[i, j] = (r, g, b)
    return clear_count

def clear_color(new_pixels, w, h):
    for i in range(w):
        for j in range(h):
            r, g, b = get_color(new_pixels, i, j)
            if np.average((r, g, b)) > 200:
                new_pixels[i, j] = (255, 255, 255)
            else:
                new_pixels[i, j] = (0, 0, 0)

def pre_image(full_path):
    image = Image.open(full_path)
    w, h = image.size
    new_image = Image.new('RGB', (w, h), 'white')
    new_pixels = new_image.load()

    clear_count = clear_noise(image, new_pixels)
    while clear_count > 0:
        clear_count = clear_noise(new_pixels, new_pixels)
        print(clear_count)
        if clear_count == 0:
            break
    clear_color(new_pixels, w, h)

    # 對比度增強
    enh_img = ImageEnhance.Contrast(new_image)
    contrast = 3
    image_contrasted = enh_img.enhance(contrast)

    dir_name = os.path.dirname(full_path)
    file_name = os.path.basename(full_path)
    new_file_path = os.path.join(dir_name, 'sharped' + file_name)
    image_contrasted.save(new_file_path)
    return new_file_path

圖片文字識別

文字識別是用tesseract
注意這裏加了白名單提高準確率
chi爲我自己訓練的識別庫，訓練集爲10個


new_file_path = imgutils.pre_image('crop{}.png'.format(url_id))
result = pytesseract.image_to_string(
    image=new_file_path,
    lang='chi',
    config='--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789評論服務:費用設施環境條.元'

結果

還湊合哦

訓練輔助腳本

下面是一些腳本集合

生成box文件
批量圖片處理
批量訓練生成訓練結果文件
批量圖片格式轉換png->tiff

都是js和python腳本，比較簡單哈~

gitee鏈接

爬蟲代碼就不放了哈~寫的太醜~目前也沒時間做代碼優化。
由於python註釋和Markdown的代碼tag重複了，註釋都去掉了，相信大家能看懂哈~

大衆點評評分爬取-圖文識別ORC 原薦

大衆點評評分爬取-圖文識別ORC

爬取頁面

截取需要部分

圖片預處理

圖片文字識別

結果

訓練輔助腳本

ElasticSearch安裝(for lynn) 原

Swoole大數據量傳輸解決方案原薦

大衆點評評分爬取-圖文識別ORC 原薦

自定義跨平臺比特幣實時行情看板原薦

electron打包太慢解決方法原

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

大衆點評評分爬取-圖文識別ORC 原 薦

大衆點評評分爬取-圖文識別ORC

爬取頁面

截取需要部分

圖片預處理

圖片文字識別

結果

訓練輔助腳本

大衆點評評分爬取-圖文識別ORC 原薦